[RFC,bpf-next,0/5] xdp: hints via kfuncs

Message ID	20221027200019.4106375-1-sdf@google.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> Date: Thu, 27 Oct 2022 13:00:14 -0700 Mime-Version: 1.0 Message-ID: <20221027200019.4106375-1-sdf@google.com> Subject: [RFC bpf-next 0/5] xdp: hints via kfuncs From: Stanislav Fomichev <sdf@google.com> To: bpf@vger.kernel.org Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, song@kernel.org, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, Jakub Kicinski <kuba@kernel.org>, Willem de Bruijn <willemb@google.com>, Jesper Dangaard Brouer <brouer@redhat.com>, Anatoly Burakov <anatoly.burakov@intel.com>, Alexander Lobakin <alexandr.lobakin@intel.com>, Magnus Karlsson <magnus.karlsson@gmail.com>, Maryam Tahhan <mtahhan@redhat.com>, xdp-hints@xdp-project.net, netdev@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	xdp: hints via kfuncs \| expand [RFC,bpf-next,0/5] xdp: hints via kfuncs [RFC,bpf-next,1/5] bpf: Support inlined/unrolled kfuncs for xdp metadata [RFC,bpf-next,2/5] veth: Support rx timestamp metadata for xdp [RFC,bpf-next,3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts [RFC,bpf-next,4/5] selftests/bpf: Convert xskxceiver to use custom program [RFC,bpf-next,5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver

Stanislav Fomichev Oct. 27, 2022, 8 p.m. UTC

This is an RFC for the alternative approach suggested by Martin and
Jakub. I've tried to CC most of the people from the original discussion,
feel free to add more if you think I've missed somebody.

Summary:
- add new BPF_F_XDP_HAS_METADATA program flag and abuse
  attr->prog_ifindex to pass target device ifindex at load time
- at load time, find appropriate ndo_unroll_kfunc and call
  it to unroll/inline kfuncs; kfuncs have the default "safe"
  implementation if unrolling is not supported by a particular
  device
- rewrite xskxceiver test to use C bpf program and extend
  it to export rx_timestamp (plus add rx timestamp to veth driver)

I've intentionally kept it small and hacky to see whether the approach is
workable or not.

Pros:
- we avoid BTF complexity; the BPF programs themselves are now responsible
  for agreeing on the metadata layout with the AF_XDP consumer
- the metadata is free if not used
- the metadata should, in theory, be cheap if used; kfuncs should be
  unrolled to the same code as if the metadata was pre-populated and
  passed with a BTF id
- it's not all or nothing; users can use small subset of metadata which
  is more efficient than the BTF id approach where all metadata has to be
  exposed for every frame (and selectively consumed by the users)

Cons:
- forwarding has to be handled explicitly; the BPF programs have to
  agree on the metadata layout (IOW, the forwarding program
  has to be aware of the final AF_XDP consumer metadata layout)
- TX picture is not clear; but it's not clear with BTF ids as well;
  I think we've agreed that just reusing whatever we have at RX
  won't fly at TX; seems like TX XDP program might be the answer
  here? (with a set of another tx kfuncs to "expose" bpf/af_xdp metatata
  back into the kernel)

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (5):
  bpf: Support inlined/unrolled kfuncs for xdp metadata
  veth: Support rx timestamp metadata for xdp
  libbpf: Pass prog_ifindex via bpf_object_open_opts
  selftests/bpf: Convert xskxceiver to use custom program
  selftests/bpf: Test rx_timestamp metadata in xskxceiver

 drivers/net/veth.c                            |  31 +++++
 include/linux/bpf.h                           |   1 +
 include/linux/btf.h                           |   1 +
 include/linux/btf_ids.h                       |   4 +
 include/linux/netdevice.h                     |   3 +
 include/net/xdp.h                             |  22 ++++
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/syscall.c                          |  28 ++++-
 kernel/bpf/verifier.c                         |  60 +++++++++
 net/core/dev.c                                |   7 ++
 net/core/xdp.c                                |  28 +++++
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf.h                        |   6 +-
 tools/testing/selftests/bpf/Makefile          |   1 +
 .../testing/selftests/bpf/progs/xskxceiver.c  |  43 +++++++
 tools/testing/selftests/bpf/xskxceiver.c      | 119 +++++++++++++++---
 tools/testing/selftests/bpf/xskxceiver.h      |   5 +-
 18 files changed, 348 insertions(+), 22 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xskxceiver.c

John Fastabend Oct. 28, 2022, 3:58 p.m. UTC | #1

Stanislav Fomichev wrote:
> This is an RFC for the alternative approach suggested by Martin and
> Jakub. I've tried to CC most of the people from the original discussion,
> feel free to add more if you think I've missed somebody.
> 
> Summary:
> - add new BPF_F_XDP_HAS_METADATA program flag and abuse
>   attr->prog_ifindex to pass target device ifindex at load time
> - at load time, find appropriate ndo_unroll_kfunc and call
>   it to unroll/inline kfuncs; kfuncs have the default "safe"
>   implementation if unrolling is not supported by a particular
>   device
> - rewrite xskxceiver test to use C bpf program and extend
>   it to export rx_timestamp (plus add rx timestamp to veth driver)
> 
> I've intentionally kept it small and hacky to see whether the approach is
> workable or not.

Hi,

I need RX timestamps now as well so was going to work on some code
next week as well.

My plan was to simply put a kptr to the rx descriptor in the xdp
buffer. If I can read the rx descriptor I can read the timestamp,
the rxhash and any other metadata the NIC has completed. All the
drivers I've looked at stash the data here.

I'll inline pro/cons compared to this below as I see it.

> 
> Pros:
> - we avoid BTF complexity; the BPF programs themselves are now responsible
>   for agreeing on the metadata layout with the AF_XDP consumer

Same no BTF is needed in kernel side. Userspace and BPF progs get
to sort it out.

> - the metadata is free if not used

Same.

> - the metadata should, in theory, be cheap if used; kfuncs should be
>   unrolled to the same code as if the metadata was pre-populated and
>   passed with a BTF id

Same its just a kptr at this point. Also one more advantage would
be ability to read the data without copying it.

> - it's not all or nothing; users can use small subset of metadata which
>   is more efficient than the BTF id approach where all metadata has to be
>   exposed for every frame (and selectively consumed by the users)

Same.

> 
> Cons:
> - forwarding has to be handled explicitly; the BPF programs have to
>   agree on the metadata layout (IOW, the forwarding program
>   has to be aware of the final AF_XDP consumer metadata layout)

Same although IMO this is a PRO. You only get the bytes you need
and care about and can also augment it with extra good stuff so
calculation only happen once.

> - TX picture is not clear; but it's not clear with BTF ids as well;
>   I think we've agreed that just reusing whatever we have at RX
>   won't fly at TX; seems like TX XDP program might be the answer
>   here? (with a set of another tx kfuncs to "expose" bpf/af_xdp metatata
>   back into the kernel)

Agree TX is not addressed.

A bit of extra commentary. By exposing the raw kptr to the rx
descriptor we don't need driver writers to do anything. And
can easily support all the drivers out the gate with simple
one or two line changes. This pushes the interesting parts
into userspace and then BPF writers get to do the work without
bother driver folks and also if its not done today it doesn't
matter because user space can come along and make it work
later. So no scattered kernel dependencies which I really
would like to avoid here. Its actually very painful to have
to support clusters with N kernels and M devices if they
have different features. Doable but annoying and much nicer
if we just say 6.2 has support for kptr rx descriptor reading
and all XDP drivers support it. So timestamp, rxhash work
across the board.

To find the offset of fields (rxhash, timestamp) you can use
standard BTF relocations we have all this machinery built up
already for all the other structs we read, net_devices, task
structs, inodes, ... so its not a big hurdle at all IMO. We
can add userspace libs if folks really care, but its just
a read so I'm not even sure that is helpful.

I think its nicer than having kfuncs that need to be written
everywhere. My $.02 although I'll poke around with below
some as well. Feel free to just hang tight until I have some
code at the moment I have intel, mellanox drivers that I
would want to support.

> 
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> 
> Stanislav Fomichev (5):
>   bpf: Support inlined/unrolled kfuncs for xdp metadata
>   veth: Support rx timestamp metadata for xdp
>   libbpf: Pass prog_ifindex via bpf_object_open_opts
>   selftests/bpf: Convert xskxceiver to use custom program
>   selftests/bpf: Test rx_timestamp metadata in xskxceiver
> 
>  drivers/net/veth.c                            |  31 +++++
>  include/linux/bpf.h                           |   1 +
>  include/linux/btf.h                           |   1 +
>  include/linux/btf_ids.h                       |   4 +
>  include/linux/netdevice.h                     |   3 +
>  include/net/xdp.h                             |  22 ++++
>  include/uapi/linux/bpf.h                      |   5 +
>  kernel/bpf/syscall.c                          |  28 ++++-
>  kernel/bpf/verifier.c                         |  60 +++++++++
>  net/core/dev.c                                |   7 ++
>  net/core/xdp.c                                |  28 +++++
>  tools/include/uapi/linux/bpf.h                |   5 +
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/lib/bpf/libbpf.h                        |   6 +-
>  tools/testing/selftests/bpf/Makefile          |   1 +
>  .../testing/selftests/bpf/progs/xskxceiver.c  |  43 +++++++
>  tools/testing/selftests/bpf/xskxceiver.c      | 119 +++++++++++++++---
>  tools/testing/selftests/bpf/xskxceiver.h      |   5 +-
>  18 files changed, 348 insertions(+), 22 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/xskxceiver.c
> 
> -- 
> 2.38.1.273.g43a17bfeac-goog
>

Jakub Kicinski Oct. 28, 2022, 6:04 p.m. UTC | #2

On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote:
> A bit of extra commentary. By exposing the raw kptr to the rx
> descriptor we don't need driver writers to do anything.
> And can easily support all the drivers out the gate with simple
> one or two line changes. This pushes the interesting parts
> into userspace and then BPF writers get to do the work without
> bother driver folks and also if its not done today it doesn't
> matter because user space can come along and make it work
> later. So no scattered kernel dependencies which I really
> would like to avoid here. Its actually very painful to have
> to support clusters with N kernels and M devices if they
> have different features. Doable but annoying and much nicer
> if we just say 6.2 has support for kptr rx descriptor reading
> and all XDP drivers support it. So timestamp, rxhash work
> across the board.

IMHO that's a bit of wishful thinking. Driver support is just a small
piece, you'll have different HW and FW versions, feature conflicts etc.
In the end kernel version is just one variable and there are many others
you'll already have to track.

And it's actually harder to abstract away inter HW generation
differences if the user space code has to handle all of it.

> To find the offset of fields (rxhash, timestamp) you can use
> standard BTF relocations we have all this machinery built up
> already for all the other structs we read, net_devices, task
> structs, inodes, ... so its not a big hurdle at all IMO. We
> can add userspace libs if folks really care, but its just a read so
> I'm not even sure that is helpful.
> 
> I think its nicer than having kfuncs that need to be written
> everywhere. My $.02 although I'll poke around with below
> some as well. Feel free to just hang tight until I have some
> code at the moment I have intel, mellanox drivers that I
> would want to support.

I'd prefer if we left the door open for new vendors. Punting descriptor
parsing to user space will indeed result in what you just said - major
vendors are supported and that's it.

Stanislav Fomichev Oct. 28, 2022, 6:46 p.m. UTC | #3

On Fri, Oct 28, 2022 at 11:05 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote:
> > A bit of extra commentary. By exposing the raw kptr to the rx
> > descriptor we don't need driver writers to do anything.
> > And can easily support all the drivers out the gate with simple
> > one or two line changes. This pushes the interesting parts
> > into userspace and then BPF writers get to do the work without
> > bother driver folks and also if its not done today it doesn't
> > matter because user space can come along and make it work
> > later. So no scattered kernel dependencies which I really
> > would like to avoid here. Its actually very painful to have
> > to support clusters with N kernels and M devices if they
> > have different features. Doable but annoying and much nicer
> > if we just say 6.2 has support for kptr rx descriptor reading
> > and all XDP drivers support it. So timestamp, rxhash work
> > across the board.
>
> IMHO that's a bit of wishful thinking. Driver support is just a small
> piece, you'll have different HW and FW versions, feature conflicts etc.
> In the end kernel version is just one variable and there are many others
> you'll already have to track.
>
> And it's actually harder to abstract away inter HW generation
> differences if the user space code has to handle all of it.

I've had the same concern:

Until we have some userspace library that abstracts all these details,
it's not really convenient to use. IIUC, with a kptr, I'd get a blob
of data and I need to go through the code and see what particular type
it represents for my particular device and how the data I need is
represented there. There are also these "if this is device v1 -> use
v1 descriptor format; if it's a v2->use this another struct; etc"
complexities that we'll be pushing onto the users. With kfuncs, we put
this burden on the driver developers, but I agree that the drawback
here is that we actually have to wait for the implementations to catch
up.

Jakub mentions FW and I haven't even thought about that; so yeah, bpf
programs might have to take a lot of other state into consideration
when parsing the descriptors; all those details do seem like they
belong to the driver code.

Feel free to send it early with just a handful of drivers implemented;
I'm more interested about bpf/af_xdp/user api story; if we have some
nice sample/test case that shows how the metadata can be used, that
might push us closer to the agreement on the best way to proceed.

> > To find the offset of fields (rxhash, timestamp) you can use
> > standard BTF relocations we have all this machinery built up
> > already for all the other structs we read, net_devices, task
> > structs, inodes, ... so its not a big hurdle at all IMO. We
> > can add userspace libs if folks really care, but its just a read so
> > I'm not even sure that is helpful.
> >
> > I think its nicer than having kfuncs that need to be written
> > everywhere. My $.02 although I'll poke around with below
> > some as well. Feel free to just hang tight until I have some
> > code at the moment I have intel, mellanox drivers that I
> > would want to support.
>
> I'd prefer if we left the door open for new vendors. Punting descriptor
> parsing to user space will indeed result in what you just said - major
> vendors are supported and that's it.

John Fastabend Oct. 28, 2022, 11:16 p.m. UTC | #4

Stanislav Fomichev wrote:
> On Fri, Oct 28, 2022 at 11:05 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote:
> > > A bit of extra commentary. By exposing the raw kptr to the rx
> > > descriptor we don't need driver writers to do anything.
> > > And can easily support all the drivers out the gate with simple
> > > one or two line changes. This pushes the interesting parts
> > > into userspace and then BPF writers get to do the work without
> > > bother driver folks and also if its not done today it doesn't
> > > matter because user space can come along and make it work
> > > later. So no scattered kernel dependencies which I really
> > > would like to avoid here. Its actually very painful to have
> > > to support clusters with N kernels and M devices if they
> > > have different features. Doable but annoying and much nicer
> > > if we just say 6.2 has support for kptr rx descriptor reading
> > > and all XDP drivers support it. So timestamp, rxhash work
> > > across the board.
> >
> > IMHO that's a bit of wishful thinking. Driver support is just a small
> > piece, you'll have different HW and FW versions, feature conflicts etc.
> > In the end kernel version is just one variable and there are many others
> > you'll already have to track.

Agree.

> >
> > And it's actually harder to abstract away inter HW generation
> > differences if the user space code has to handle all of it.

I don't see how its any harder in practice though?

> 
> I've had the same concern:
> 
> Until we have some userspace library that abstracts all these details,
> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> of data and I need to go through the code and see what particular type
> it represents for my particular device and how the data I need is
> represented there. There are also these "if this is device v1 -> use
> v1 descriptor format; if it's a v2->use this another struct; etc"
> complexities that we'll be pushing onto the users. With kfuncs, we put
> this burden on the driver developers, but I agree that the drawback
> here is that we actually have to wait for the implementations to catch
> up.

I agree with everything there, you will get a blob of data and then
will need to know what field you want to read using BTF. But, we
already do this for BPF programs all over the place so its not a big
lift for us. All other BPF tracing/observability requires the same
logic. I think users of BPF in general perhaps XDP/tc are the only
place left to write BPF programs without thinking about BTF and
kernel data structures.

But, with proposed kptr the complexity lives in userspace and can be
fixed, added, updated without having to bother with kernel updates, etc.
From my point of view of supporting Cilium its a win and much preferred
to having to deal with driver owners on all cloud vendors, distributions,
and so on.

If vendor updates firmware with new fields I get those immediately.

> 
> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> programs might have to take a lot of other state into consideration
> when parsing the descriptors; all those details do seem like they
> belong to the driver code.

I would prefer to avoid being stuck on requiring driver writers to
be involved. With just a kptr I can support the device and any
firwmare versions without requiring help.

> 
> Feel free to send it early with just a handful of drivers implemented;
> I'm more interested about bpf/af_xdp/user api story; if we have some
> nice sample/test case that shows how the metadata can be used, that
> might push us closer to the agreement on the best way to proceed.

I'll try to do a intel and mlx implementation to get a cross section.
I have a good collection of nics here so should be able to show a
couple firmware versions. It could be fine I think to have the raw
kptr access and then also kfuncs for some things perhaps.


> 
> 
> 
> > > To find the offset of fields (rxhash, timestamp) you can use
> > > standard BTF relocations we have all this machinery built up
> > > already for all the other structs we read, net_devices, task
> > > structs, inodes, ... so its not a big hurdle at all IMO. We
> > > can add userspace libs if folks really care, but its just a read so
> > > I'm not even sure that is helpful.
> > >
> > > I think its nicer than having kfuncs that need to be written
> > > everywhere. My $.02 although I'll poke around with below
> > > some as well. Feel free to just hang tight until I have some
> > > code at the moment I have intel, mellanox drivers that I
> > > would want to support.
> >
> > I'd prefer if we left the door open for new vendors. Punting descriptor
> > parsing to user space will indeed result in what you just said - major
> > vendors are supported and that's it.

I'm not sure about why it would make it harder for new vendors? I think
the opposite, it would be easier because I don't need vendor support
at all. Thinking it over seems there could be room for both.


Thanks!

Jakub Kicinski Oct. 29, 2022, 1:14 a.m. UTC | #5

On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> > > And it's actually harder to abstract away inter HW generation
> > > differences if the user space code has to handle all of it.  
> 
> I don't see how its any harder in practice though?

You need to find out what HW/FW/config you're running, right?
And all you have is a pointer to a blob of unknown type.

Take timestamps for example, some NICs support adjusting the PHC 
or doing SW corrections (with different versions of hw/fw/server
platforms being capable of both/one/neither).

Sure you can extract all this info with tracing and careful
inspection via uAPI. But I don't think that's _easier_.
And the vendors can't run the results thru their validation 
(for whatever that's worth).

> > I've had the same concern:
> > 
> > Until we have some userspace library that abstracts all these details,
> > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> > of data and I need to go through the code and see what particular type
> > it represents for my particular device and how the data I need is
> > represented there. There are also these "if this is device v1 -> use
> > v1 descriptor format; if it's a v2->use this another struct; etc"
> > complexities that we'll be pushing onto the users. With kfuncs, we put
> > this burden on the driver developers, but I agree that the drawback
> > here is that we actually have to wait for the implementations to catch
> > up.  
> 
> I agree with everything there, you will get a blob of data and then
> will need to know what field you want to read using BTF. But, we
> already do this for BPF programs all over the place so its not a big
> lift for us. All other BPF tracing/observability requires the same
> logic. I think users of BPF in general perhaps XDP/tc are the only
> place left to write BPF programs without thinking about BTF and
> kernel data structures.
> 
> But, with proposed kptr the complexity lives in userspace and can be
> fixed, added, updated without having to bother with kernel updates, etc.
> From my point of view of supporting Cilium its a win and much preferred
> to having to deal with driver owners on all cloud vendors, distributions,
> and so on.
> 
> If vendor updates firmware with new fields I get those immediately.

Conversely it's a valid concern that those who *do* actually update
their kernel regularly will have more things to worry about.

> > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> > programs might have to take a lot of other state into consideration
> > when parsing the descriptors; all those details do seem like they
> > belong to the driver code.  
> 
> I would prefer to avoid being stuck on requiring driver writers to
> be involved. With just a kptr I can support the device and any
> firwmare versions without requiring help.

1) where are you getting all those HW / FW specs :S
2) maybe *you* can but you're not exactly not an ex-driver developer :S

> > Feel free to send it early with just a handful of drivers implemented;
> > I'm more interested about bpf/af_xdp/user api story; if we have some
> > nice sample/test case that shows how the metadata can be used, that
> > might push us closer to the agreement on the best way to proceed.  
> 
> I'll try to do a intel and mlx implementation to get a cross section.
> I have a good collection of nics here so should be able to show a
> couple firmware versions. It could be fine I think to have the raw
> kptr access and then also kfuncs for some things perhaps.
> 
> > > I'd prefer if we left the door open for new vendors. Punting descriptor
> > > parsing to user space will indeed result in what you just said - major
> > > vendors are supported and that's it.  
> 
> I'm not sure about why it would make it harder for new vendors? I think
> the opposite, 

TBH I'm only replying to the email because of the above part :)
I thought this would be self evident, but I guess our perspectives 
are different.

Perhaps you look at it from the perspective of SW running on someone
else's cloud, an being able to move to another cloud, without having 
to worry if feature X is available in xdp or just skb.

I look at it from the perspective of maintaining a cloud, with people
writing random XDP applications. If I swap a NIC from an incumbent to a
(superior) startup, and cloud users are messing with raw descriptor -
I'd need to go find every XDP program out there and make sure it
understands the new descriptors.

There is a BPF foundation or whatnot now - what about starting a
certification program for cloud providers and making it clear what
features must be supported to be compatible with XDP 1.0, XDP 2.0 etc?

> it would be easier because I don't need vendor support at all.

Can you support the enfabrica NIC on day 1? :) To an extent, its just
shifting the responsibility from the HW vendor to the middleware vendor.

> Thinking it over seems there could be room for both.

Are you thinking more or less Stan's proposal but with one of 
the callbacks being "give me the raw thing"? Probably as a ro dynptr?
Possible, but I don't think we need to hold off Stan's work.

Florian Bezdeka Oct. 31, 2022, 2:10 p.m. UTC | #6

Hi all,

I was closely following this discussion for some time now. Seems we
reached the point where it's getting interesting for me.

On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> > > > And it's actually harder to abstract away inter HW generation
> > > > differences if the user space code has to handle all of it.  
> > 
> > I don't see how its any harder in practice though?
> 
> You need to find out what HW/FW/config you're running, right?
> And all you have is a pointer to a blob of unknown type.
> 
> Take timestamps for example, some NICs support adjusting the PHC 
> or doing SW corrections (with different versions of hw/fw/server
> platforms being capable of both/one/neither).
> 
> Sure you can extract all this info with tracing and careful
> inspection via uAPI. But I don't think that's _easier_.
> And the vendors can't run the results thru their validation 
> (for whatever that's worth).
> 
> > > I've had the same concern:
> > > 
> > > Until we have some userspace library that abstracts all these details,
> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> > > of data and I need to go through the code and see what particular type
> > > it represents for my particular device and how the data I need is
> > > represented there. There are also these "if this is device v1 -> use
> > > v1 descriptor format; if it's a v2->use this another struct; etc"
> > > complexities that we'll be pushing onto the users. With kfuncs, we put
> > > this burden on the driver developers, but I agree that the drawback
> > > here is that we actually have to wait for the implementations to catch
> > > up.  
> > 
> > I agree with everything there, you will get a blob of data and then
> > will need to know what field you want to read using BTF. But, we
> > already do this for BPF programs all over the place so its not a big
> > lift for us. All other BPF tracing/observability requires the same
> > logic. I think users of BPF in general perhaps XDP/tc are the only
> > place left to write BPF programs without thinking about BTF and
> > kernel data structures.
> > 
> > But, with proposed kptr the complexity lives in userspace and can be
> > fixed, added, updated without having to bother with kernel updates, etc.
> > From my point of view of supporting Cilium its a win and much preferred
> > to having to deal with driver owners on all cloud vendors, distributions,
> > and so on.
> > 
> > If vendor updates firmware with new fields I get those immediately.
> 
> Conversely it's a valid concern that those who *do* actually update
> their kernel regularly will have more things to worry about.
> 
> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> > > programs might have to take a lot of other state into consideration
> > > when parsing the descriptors; all those details do seem like they
> > > belong to the driver code.  
> > 
> > I would prefer to avoid being stuck on requiring driver writers to
> > be involved. With just a kptr I can support the device and any
> > firwmare versions without requiring help.
> 
> 1) where are you getting all those HW / FW specs :S
> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> 
> > > Feel free to send it early with just a handful of drivers implemented;
> > > I'm more interested about bpf/af_xdp/user api story; if we have some
> > > nice sample/test case that shows how the metadata can be used, that
> > > might push us closer to the agreement on the best way to proceed.  
> > 
> > I'll try to do a intel and mlx implementation to get a cross section.
> > I have a good collection of nics here so should be able to show a
> > couple firmware versions. It could be fine I think to have the raw
> > kptr access and then also kfuncs for some things perhaps.
> > 
> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
> > > > parsing to user space will indeed result in what you just said - major
> > > > vendors are supported and that's it.  
> > 
> > I'm not sure about why it would make it harder for new vendors? I think
> > the opposite, 
> 
> TBH I'm only replying to the email because of the above part :)
> I thought this would be self evident, but I guess our perspectives 
> are different.
> 
> Perhaps you look at it from the perspective of SW running on someone
> else's cloud, an being able to move to another cloud, without having 
> to worry if feature X is available in xdp or just skb.
> 
> I look at it from the perspective of maintaining a cloud, with people
> writing random XDP applications. If I swap a NIC from an incumbent to a
> (superior) startup, and cloud users are messing with raw descriptor -
> I'd need to go find every XDP program out there and make sure it
> understands the new descriptors.

Here is another perspective:

As AF_XDP application developer I don't wan't to deal with the
underlying hardware in detail. I like to request a feature from the OS
(in this case rx/tx timestamping). If the feature is available I will
simply use it, if not I might have to work around it - maybe by falling
back to SW timestamping.

All parts of my application (BPF program included) should not be
optimized/adjusted for all the different HW variants out there.

My application might be run on bare metal/cloud/virtual systems. I do
not want to care about this scenarios differently.

I followed the idea of having a library for parsing the driver specific
meta information. That would mean that this library has to keep in sync
with the kernel, right? It doesn't help if a newer kernel provides XDP
hints support for more devices/drivers but the library is not updated.
That might be relevant for all the device update strategies out there.

In addition - and maybe even contrary - we care about zero copy (ZC)
support. Our current use case has to deal with a lot of small packets,
so we hope to benefit from that. If XDP hints support requires a copy
of the meta data - maybe to drive a HW independent interface - that
might be a bottle neck for us.

> 
> There is a BPF foundation or whatnot now - what about starting a
> certification program for cloud providers and making it clear what
> features must be supported to be compatible with XDP 1.0, XDP 2.0 etc?
> 
> > it would be easier because I don't need vendor support at all.
> 
> Can you support the enfabrica NIC on day 1? :) To an extent, its just
> shifting the responsibility from the HW vendor to the middleware vendor.
> 
> > Thinking it over seems there could be room for both.
> 
> Are you thinking more or less Stan's proposal but with one of 
> the callbacks being "give me the raw thing"? Probably as a ro dynptr?
> Possible, but I don't think we need to hold off Stan's work.

Toke Høiland-Jørgensen Oct. 31, 2022, 3:28 p.m. UTC | #7

"Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:

> Hi all,
>
> I was closely following this discussion for some time now. Seems we
> reached the point where it's getting interesting for me.
>
> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>> > > > And it's actually harder to abstract away inter HW generation
>> > > > differences if the user space code has to handle all of it.  
>> > 
>> > I don't see how its any harder in practice though?
>> 
>> You need to find out what HW/FW/config you're running, right?
>> And all you have is a pointer to a blob of unknown type.
>> 
>> Take timestamps for example, some NICs support adjusting the PHC 
>> or doing SW corrections (with different versions of hw/fw/server
>> platforms being capable of both/one/neither).
>> 
>> Sure you can extract all this info with tracing and careful
>> inspection via uAPI. But I don't think that's _easier_.
>> And the vendors can't run the results thru their validation 
>> (for whatever that's worth).
>> 
>> > > I've had the same concern:
>> > > 
>> > > Until we have some userspace library that abstracts all these details,
>> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>> > > of data and I need to go through the code and see what particular type
>> > > it represents for my particular device and how the data I need is
>> > > represented there. There are also these "if this is device v1 -> use
>> > > v1 descriptor format; if it's a v2->use this another struct; etc"
>> > > complexities that we'll be pushing onto the users. With kfuncs, we put
>> > > this burden on the driver developers, but I agree that the drawback
>> > > here is that we actually have to wait for the implementations to catch
>> > > up.  
>> > 
>> > I agree with everything there, you will get a blob of data and then
>> > will need to know what field you want to read using BTF. But, we
>> > already do this for BPF programs all over the place so its not a big
>> > lift for us. All other BPF tracing/observability requires the same
>> > logic. I think users of BPF in general perhaps XDP/tc are the only
>> > place left to write BPF programs without thinking about BTF and
>> > kernel data structures.
>> > 
>> > But, with proposed kptr the complexity lives in userspace and can be
>> > fixed, added, updated without having to bother with kernel updates, etc.
>> > From my point of view of supporting Cilium its a win and much preferred
>> > to having to deal with driver owners on all cloud vendors, distributions,
>> > and so on.
>> > 
>> > If vendor updates firmware with new fields I get those immediately.
>> 
>> Conversely it's a valid concern that those who *do* actually update
>> their kernel regularly will have more things to worry about.
>> 
>> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>> > > programs might have to take a lot of other state into consideration
>> > > when parsing the descriptors; all those details do seem like they
>> > > belong to the driver code.  
>> > 
>> > I would prefer to avoid being stuck on requiring driver writers to
>> > be involved. With just a kptr I can support the device and any
>> > firwmare versions without requiring help.
>> 
>> 1) where are you getting all those HW / FW specs :S
>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>> 
>> > > Feel free to send it early with just a handful of drivers implemented;
>> > > I'm more interested about bpf/af_xdp/user api story; if we have some
>> > > nice sample/test case that shows how the metadata can be used, that
>> > > might push us closer to the agreement on the best way to proceed.  
>> > 
>> > I'll try to do a intel and mlx implementation to get a cross section.
>> > I have a good collection of nics here so should be able to show a
>> > couple firmware versions. It could be fine I think to have the raw
>> > kptr access and then also kfuncs for some things perhaps.
>> > 
>> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
>> > > > parsing to user space will indeed result in what you just said - major
>> > > > vendors are supported and that's it.  
>> > 
>> > I'm not sure about why it would make it harder for new vendors? I think
>> > the opposite, 
>> 
>> TBH I'm only replying to the email because of the above part :)
>> I thought this would be self evident, but I guess our perspectives 
>> are different.
>> 
>> Perhaps you look at it from the perspective of SW running on someone
>> else's cloud, an being able to move to another cloud, without having 
>> to worry if feature X is available in xdp or just skb.
>> 
>> I look at it from the perspective of maintaining a cloud, with people
>> writing random XDP applications. If I swap a NIC from an incumbent to a
>> (superior) startup, and cloud users are messing with raw descriptor -
>> I'd need to go find every XDP program out there and make sure it
>> understands the new descriptors.
>
> Here is another perspective:
>
> As AF_XDP application developer I don't wan't to deal with the
> underlying hardware in detail. I like to request a feature from the OS
> (in this case rx/tx timestamping). If the feature is available I will
> simply use it, if not I might have to work around it - maybe by falling
> back to SW timestamping.
>
> All parts of my application (BPF program included) should not be
> optimized/adjusted for all the different HW variants out there.

Yes, absolutely agreed. Abstracting away those kinds of hardware
differences is the whole *point* of having an OS/driver model. I.e.,
it's what the kernel is there for! If people want to bypass that and get
direct access to the hardware, they can already do that by using DPDK.

So in other words, 100% agreed that we should not expect the BPF
developers to deal with hardware details as would be required with a
kptr-based interface.

As for the kfunc-based interface, I think it shows some promise.
Exposing a list of function names to retrieve individual metadata items
instead of a struct layout is sorta comparable in terms of developer UI
accessibility etc (IMO).

There are three main drawbacks, AFAICT:

1. It requires driver developers to write and maintain the code that
generates the unrolled BPF bytecode to access the metadata fields, which
is a non-trivial amount of complexity. Maybe this can be abstracted away
with some internal helpers though (like, e.g., a
bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
the required JMP/MOV/LDX instructions?

2. AF_XDP programs won't be able to access the metadata without using a
custom XDP program that calls the kfuncs and puts the data into the
metadata area. We could solve this with some code in libxdp, though; if
this code can be made generic enough (so it just dumps the available
metadata functions from the running kernel at load time), it may be
possible to make it generic enough that it will be forward-compatible
with new versions of the kernel that add new fields, which should
alleviate Florian's concern about keeping things in sync.

3. It will make it harder to consume the metadata when building SKBs. I
think the CPUMAP and veth use cases are also quite important, and that
we want metadata to be available for building SKBs in this path. Maybe
this can be resolved by having a convenient kfunc for this that can be
used for programs doing such redirects. E.g., you could just call
xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
would recursively expand into all the kfunc calls needed to extract the
metadata supported by the SKB path?

-Toke

Stanislav Fomichev Oct. 31, 2022, 5 p.m. UTC | #8

On Mon, Oct 31, 2022 at 8:28 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
>
> > Hi all,
> >
> > I was closely following this discussion for some time now. Seems we
> > reached the point where it's getting interesting for me.
> >
> > On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >> > > > And it's actually harder to abstract away inter HW generation
> >> > > > differences if the user space code has to handle all of it.
> >> >
> >> > I don't see how its any harder in practice though?
> >>
> >> You need to find out what HW/FW/config you're running, right?
> >> And all you have is a pointer to a blob of unknown type.
> >>
> >> Take timestamps for example, some NICs support adjusting the PHC
> >> or doing SW corrections (with different versions of hw/fw/server
> >> platforms being capable of both/one/neither).
> >>
> >> Sure you can extract all this info with tracing and careful
> >> inspection via uAPI. But I don't think that's _easier_.
> >> And the vendors can't run the results thru their validation
> >> (for whatever that's worth).
> >>
> >> > > I've had the same concern:
> >> > >
> >> > > Until we have some userspace library that abstracts all these details,
> >> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >> > > of data and I need to go through the code and see what particular type
> >> > > it represents for my particular device and how the data I need is
> >> > > represented there. There are also these "if this is device v1 -> use
> >> > > v1 descriptor format; if it's a v2->use this another struct; etc"
> >> > > complexities that we'll be pushing onto the users. With kfuncs, we put
> >> > > this burden on the driver developers, but I agree that the drawback
> >> > > here is that we actually have to wait for the implementations to catch
> >> > > up.
> >> >
> >> > I agree with everything there, you will get a blob of data and then
> >> > will need to know what field you want to read using BTF. But, we
> >> > already do this for BPF programs all over the place so its not a big
> >> > lift for us. All other BPF tracing/observability requires the same
> >> > logic. I think users of BPF in general perhaps XDP/tc are the only
> >> > place left to write BPF programs without thinking about BTF and
> >> > kernel data structures.
> >> >
> >> > But, with proposed kptr the complexity lives in userspace and can be
> >> > fixed, added, updated without having to bother with kernel updates, etc.
> >> > From my point of view of supporting Cilium its a win and much preferred
> >> > to having to deal with driver owners on all cloud vendors, distributions,
> >> > and so on.
> >> >
> >> > If vendor updates firmware with new fields I get those immediately.
> >>
> >> Conversely it's a valid concern that those who *do* actually update
> >> their kernel regularly will have more things to worry about.
> >>
> >> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >> > > programs might have to take a lot of other state into consideration
> >> > > when parsing the descriptors; all those details do seem like they
> >> > > belong to the driver code.
> >> >
> >> > I would prefer to avoid being stuck on requiring driver writers to
> >> > be involved. With just a kptr I can support the device and any
> >> > firwmare versions without requiring help.
> >>
> >> 1) where are you getting all those HW / FW specs :S
> >> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>
> >> > > Feel free to send it early with just a handful of drivers implemented;
> >> > > I'm more interested about bpf/af_xdp/user api story; if we have some
> >> > > nice sample/test case that shows how the metadata can be used, that
> >> > > might push us closer to the agreement on the best way to proceed.
> >> >
> >> > I'll try to do a intel and mlx implementation to get a cross section.
> >> > I have a good collection of nics here so should be able to show a
> >> > couple firmware versions. It could be fine I think to have the raw
> >> > kptr access and then also kfuncs for some things perhaps.
> >> >
> >> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
> >> > > > parsing to user space will indeed result in what you just said - major
> >> > > > vendors are supported and that's it.
> >> >
> >> > I'm not sure about why it would make it harder for new vendors? I think
> >> > the opposite,
> >>
> >> TBH I'm only replying to the email because of the above part :)
> >> I thought this would be self evident, but I guess our perspectives
> >> are different.
> >>
> >> Perhaps you look at it from the perspective of SW running on someone
> >> else's cloud, an being able to move to another cloud, without having
> >> to worry if feature X is available in xdp or just skb.
> >>
> >> I look at it from the perspective of maintaining a cloud, with people
> >> writing random XDP applications. If I swap a NIC from an incumbent to a
> >> (superior) startup, and cloud users are messing with raw descriptor -
> >> I'd need to go find every XDP program out there and make sure it
> >> understands the new descriptors.
> >
> > Here is another perspective:
> >
> > As AF_XDP application developer I don't wan't to deal with the
> > underlying hardware in detail. I like to request a feature from the OS
> > (in this case rx/tx timestamping). If the feature is available I will
> > simply use it, if not I might have to work around it - maybe by falling
> > back to SW timestamping.
> >
> > All parts of my application (BPF program included) should not be
> > optimized/adjusted for all the different HW variants out there.
>
> Yes, absolutely agreed. Abstracting away those kinds of hardware
> differences is the whole *point* of having an OS/driver model. I.e.,
> it's what the kernel is there for! If people want to bypass that and get
> direct access to the hardware, they can already do that by using DPDK.
>
> So in other words, 100% agreed that we should not expect the BPF
> developers to deal with hardware details as would be required with a
> kptr-based interface.
>
> As for the kfunc-based interface, I think it shows some promise.
> Exposing a list of function names to retrieve individual metadata items
> instead of a struct layout is sorta comparable in terms of developer UI
> accessibility etc (IMO).
>
> There are three main drawbacks, AFAICT:
>
> 1. It requires driver developers to write and maintain the code that
> generates the unrolled BPF bytecode to access the metadata fields, which
> is a non-trivial amount of complexity. Maybe this can be abstracted away
> with some internal helpers though (like, e.g., a
> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> the required JMP/MOV/LDX instructions?

Right, I hope we can have some helpers to abstract the raw instructions.
I might need to try to implement the actual metadata fetching for some
real devices and see how well it works in practice.

> 2. AF_XDP programs won't be able to access the metadata without using a
> custom XDP program that calls the kfuncs and puts the data into the
> metadata area. We could solve this with some code in libxdp, though; if
> this code can be made generic enough (so it just dumps the available
> metadata functions from the running kernel at load time), it may be
> possible to make it generic enough that it will be forward-compatible
> with new versions of the kernel that add new fields, which should
> alleviate Florian's concern about keeping things in sync.

Good point. I had to convert to a custom program to use the kfuncs :-(
But your suggestion sounds good; maybe libxdp can accept some extra
info about at which offset the user would like to place the metadata
and the library can generate the required bytecode?

> 3. It will make it harder to consume the metadata when building SKBs. I
> think the CPUMAP and veth use cases are also quite important, and that
> we want metadata to be available for building SKBs in this path. Maybe
> this can be resolved by having a convenient kfunc for this that can be
> used for programs doing such redirects. E.g., you could just call
> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> would recursively expand into all the kfunc calls needed to extract the
> metadata supported by the SKB path?

So this xdp_copy_metadata_for_skb will create a metadata layout that
the kernel will be able to understand when converting back to skb?
IIUC, the xdp program will look something like the following:

if (xdp packet is to be consumed by af_xdp) {
  // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
own metadata layout
  return bpf_redirect_map(xsk, ...);
} else {
  // if the packet is to be consumed by the kernel
  xdp_copy_metadata_for_skb(ctx);
  return bpf_redirect(...);
}

Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
put some magic number in the first byte(s) of the metadata so the
kernel can check whether xdp_copy_metadata_for_skb has been called
previously (or maybe xdp_frame can carry this extra signal, idk).

John Fastabend Oct. 31, 2022, 5:01 p.m. UTC | #9

Jakub Kicinski wrote:
> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> > > > And it's actually harder to abstract away inter HW generation
> > > > differences if the user space code has to handle all of it.  
> > 
> > I don't see how its any harder in practice though?
> 
> You need to find out what HW/FW/config you're running, right?
> And all you have is a pointer to a blob of unknown type.

Yep. I guess I'm in the position of already having to do this
somewhat to collect stats from the device. Although its maybe
a bit more involved here by vendors that are versioning the
descriptors.

Also nit, its not unknown type we know the full type by BTF.

> 
> Take timestamps for example, some NICs support adjusting the PHC 
> or doing SW corrections (with different versions of hw/fw/server
> platforms being capable of both/one/neither).

Its worse actually.

Having started to do this timestamping it is not at all consistent
across nics so I think we are stuck having to know hw specifics
here regardless. Also some nics will timestamp all RX pkts, some
specific pkts, some require configuration to decide which mode
to run in and so on. You then end up with a matrix of features
supported by hw/fw/sw and desired state and I can't see any way
around this.

> 
> Sure you can extract all this info with tracing and careful
> inspection via uAPI. But I don't think that's _easier_.
> And the vendors can't run the results thru their validation 
> (for whatever that's worth).

I think you hit our point of view differences below. See I
don't want to depend on the vendor. I want access to the
fields otherwise I'm stuck working with vendors on their
time frames. You have the other perspective of supporting the
NIC and ability to update kernels where as I still live with
4.18/4.19 kernels (even 4.14 sometimes). So what we land now
needs to work in 5 years still.

> 
> > > I've had the same concern:
> > > 
> > > Until we have some userspace library that abstracts all these details,
> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> > > of data and I need to go through the code and see what particular type
> > > it represents for my particular device and how the data I need is
> > > represented there. There are also these "if this is device v1 -> use
> > > v1 descriptor format; if it's a v2->use this another struct; etc"
> > > complexities that we'll be pushing onto the users. With kfuncs, we put
> > > this burden on the driver developers, but I agree that the drawback
> > > here is that we actually have to wait for the implementations to catch
> > > up.  
> > 
> > I agree with everything there, you will get a blob of data and then
> > will need to know what field you want to read using BTF. But, we
> > already do this for BPF programs all over the place so its not a big
> > lift for us. All other BPF tracing/observability requires the same
> > logic. I think users of BPF in general perhaps XDP/tc are the only
> > place left to write BPF programs without thinking about BTF and
> > kernel data structures.
> > 
> > But, with proposed kptr the complexity lives in userspace and can be
> > fixed, added, updated without having to bother with kernel updates, etc.
> > From my point of view of supporting Cilium its a win and much preferred
> > to having to deal with driver owners on all cloud vendors, distributions,
> > and so on.
> > 
> > If vendor updates firmware with new fields I get those immediately.
> 
> Conversely it's a valid concern that those who *do* actually update
> their kernel regularly will have more things to worry about.

I'm not sure if a kptr_func is any harder to write than a user space
relocation for that func? In tetragon and cilium we've done userspace
rewrites for some time. Happy to generalize that infra into kernel
repo if that helps. IMO having a libhw.h in kernel tree ./tools/bpf
directory would work.

> 
> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> > > programs might have to take a lot of other state into consideration
> > > when parsing the descriptors; all those details do seem like they
> > > belong to the driver code.  
> > 
> > I would prefer to avoid being stuck on requiring driver writers to
> > be involved. With just a kptr I can support the device and any
> > firwmare versions without requiring help.
> 
> 1) where are you getting all those HW / FW specs :S

Most are public docs of course vendors have internal docs with more
details but what can you do. Also source code has the structs.

> 2) maybe *you* can but you're not exactly not an ex-driver developer :S

Sure :) but we put a libhw.h file in kernel and test with selftests
(which will be hard without hardware) and then not everyone needs
to be a driver internals expert.

> 
> > > Feel free to send it early with just a handful of drivers implemented;
> > > I'm more interested about bpf/af_xdp/user api story; if we have some
> > > nice sample/test case that shows how the metadata can be used, that
> > > might push us closer to the agreement on the best way to proceed.  
> > 
> > I'll try to do a intel and mlx implementation to get a cross section.
> > I have a good collection of nics here so should be able to show a
> > couple firmware versions. It could be fine I think to have the raw
> > kptr access and then also kfuncs for some things perhaps.
> > 
> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
> > > > parsing to user space will indeed result in what you just said - major
> > > > vendors are supported and that's it.  
> > 
> > I'm not sure about why it would make it harder for new vendors? I think
> > the opposite, 
> 
> TBH I'm only replying to the email because of the above part :)
> I thought this would be self evident, but I guess our perspectives 
> are different.

Yep.

> 
> Perhaps you look at it from the perspective of SW running on someone
> else's cloud, an being able to move to another cloud, without having 
> to worry if feature X is available in xdp or just skb.

Exactly. I have SW running in a data center or cloud for a security
team or ops team and they don't own the platform usually. Anyways
the platform team is going to stay on a LTS kernel for at least a
year or two most likely. I maintain the SW and some 3rd party
nic vendor may not even know about my SW (cilium/tetragon).

> 
> I look at it from the perspective of maintaining a cloud, with people
> writing random XDP applications. If I swap a NIC from an incumbent to a
> (superior) startup, and cloud users are messing with raw descriptor -
> I'd need to go find every XDP program out there and make sure it
> understands the new descriptors.

I get it. Its interesting that you wouldn't tell the XDP programmers
to deal with it which is my case. My $.02 is a userspace lib could
abstract this easier than a kernel func and also add new features
without rolling new kernels.

> 
> There is a BPF foundation or whatnot now - what about starting a
> certification program for cloud providers and making it clear what
> features must be supported to be compatible with XDP 1.0, XDP 2.0 etc?

Maybe but still stuck on kernel versions.

> 
> > it would be easier because I don't need vendor support at all.
> 
> Can you support the enfabrica NIC on day 1? :) To an extent, its just
> shifting the responsibility from the HW vendor to the middleware vendor.

Yep. With important detail that I can run new features on old kernels
even if vendor didn't add rxhash or timestamp out the gate.

Nothing stops the hw vendor from contributing to this in kernel
source bpf lib with the features though.

> 
> > Thinking it over seems there could be room for both.
> 
> Are you thinking more or less Stan's proposal but with one of 
> the callbacks being "give me the raw thing"? Probably as a ro dynptr?
> Possible, but I don't think we need to hold off Stan's work.

Yeah that was my thinking. Both could coexist. OTOH doing it in
BPF program lib side seems cleaner to me from a kernel maitenance
and hardware support. We've had trouble getting drivers to support
XDP features so adding more requirements of entry seems problematic
to me when we can avoid it.

Yonghong Song Oct. 31, 2022, 7:36 p.m. UTC | #10

On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> 
>> Hi all,
>>
>> I was closely following this discussion for some time now. Seems we
>> reached the point where it's getting interesting for me.
>>
>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>>>>>> And it's actually harder to abstract away inter HW generation
>>>>>> differences if the user space code has to handle all of it.
>>>>
>>>> I don't see how its any harder in practice though?
>>>
>>> You need to find out what HW/FW/config you're running, right?
>>> And all you have is a pointer to a blob of unknown type.
>>>
>>> Take timestamps for example, some NICs support adjusting the PHC
>>> or doing SW corrections (with different versions of hw/fw/server
>>> platforms being capable of both/one/neither).
>>>
>>> Sure you can extract all this info with tracing and careful
>>> inspection via uAPI. But I don't think that's _easier_.
>>> And the vendors can't run the results thru their validation
>>> (for whatever that's worth).
>>>
>>>>> I've had the same concern:
>>>>>
>>>>> Until we have some userspace library that abstracts all these details,
>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>>>>> of data and I need to go through the code and see what particular type
>>>>> it represents for my particular device and how the data I need is
>>>>> represented there. There are also these "if this is device v1 -> use
>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
>>>>> this burden on the driver developers, but I agree that the drawback
>>>>> here is that we actually have to wait for the implementations to catch
>>>>> up.
>>>>
>>>> I agree with everything there, you will get a blob of data and then
>>>> will need to know what field you want to read using BTF. But, we
>>>> already do this for BPF programs all over the place so its not a big
>>>> lift for us. All other BPF tracing/observability requires the same
>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
>>>> place left to write BPF programs without thinking about BTF and
>>>> kernel data structures.
>>>>
>>>> But, with proposed kptr the complexity lives in userspace and can be
>>>> fixed, added, updated without having to bother with kernel updates, etc.
>>>>  From my point of view of supporting Cilium its a win and much preferred
>>>> to having to deal with driver owners on all cloud vendors, distributions,
>>>> and so on.
>>>>
>>>> If vendor updates firmware with new fields I get those immediately.
>>>
>>> Conversely it's a valid concern that those who *do* actually update
>>> their kernel regularly will have more things to worry about.
>>>
>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>>>>> programs might have to take a lot of other state into consideration
>>>>> when parsing the descriptors; all those details do seem like they
>>>>> belong to the driver code.
>>>>
>>>> I would prefer to avoid being stuck on requiring driver writers to
>>>> be involved. With just a kptr I can support the device and any
>>>> firwmare versions without requiring help.
>>>
>>> 1) where are you getting all those HW / FW specs :S
>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>>>
>>>>> Feel free to send it early with just a handful of drivers implemented;
>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
>>>>> nice sample/test case that shows how the metadata can be used, that
>>>>> might push us closer to the agreement on the best way to proceed.
>>>>
>>>> I'll try to do a intel and mlx implementation to get a cross section.
>>>> I have a good collection of nics here so should be able to show a
>>>> couple firmware versions. It could be fine I think to have the raw
>>>> kptr access and then also kfuncs for some things perhaps.
>>>>
>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
>>>>>> parsing to user space will indeed result in what you just said - major
>>>>>> vendors are supported and that's it.
>>>>
>>>> I'm not sure about why it would make it harder for new vendors? I think
>>>> the opposite,
>>>
>>> TBH I'm only replying to the email because of the above part :)
>>> I thought this would be self evident, but I guess our perspectives
>>> are different.
>>>
>>> Perhaps you look at it from the perspective of SW running on someone
>>> else's cloud, an being able to move to another cloud, without having
>>> to worry if feature X is available in xdp or just skb.
>>>
>>> I look at it from the perspective of maintaining a cloud, with people
>>> writing random XDP applications. If I swap a NIC from an incumbent to a
>>> (superior) startup, and cloud users are messing with raw descriptor -
>>> I'd need to go find every XDP program out there and make sure it
>>> understands the new descriptors.
>>
>> Here is another perspective:
>>
>> As AF_XDP application developer I don't wan't to deal with the
>> underlying hardware in detail. I like to request a feature from the OS
>> (in this case rx/tx timestamping). If the feature is available I will
>> simply use it, if not I might have to work around it - maybe by falling
>> back to SW timestamping.
>>
>> All parts of my application (BPF program included) should not be
>> optimized/adjusted for all the different HW variants out there.
> 
> Yes, absolutely agreed. Abstracting away those kinds of hardware
> differences is the whole *point* of having an OS/driver model. I.e.,
> it's what the kernel is there for! If people want to bypass that and get
> direct access to the hardware, they can already do that by using DPDK.
> 
> So in other words, 100% agreed that we should not expect the BPF
> developers to deal with hardware details as would be required with a
> kptr-based interface.
> 
> As for the kfunc-based interface, I think it shows some promise.
> Exposing a list of function names to retrieve individual metadata items
> instead of a struct layout is sorta comparable in terms of developer UI
> accessibility etc (IMO).

Looks like there are quite some use cases for hw_timestamp.
Do you think we could add it to the uapi like struct xdp_md?

The following is the current xdp_md:
struct xdp_md {
         __u32 data;
         __u32 data_end;
         __u32 data_meta;
         /* Below access go through struct xdp_rxq_info */
         __u32 ingress_ifindex; /* rxq->dev->ifindex */
         __u32 rx_queue_index;  /* rxq->queue_index  */

         __u32 egress_ifindex;  /* txq->dev->ifindex */
};

We could add  __u64 hw_timestamp to the xdp_md so user
can just do xdp_md->hw_timestamp to get the value.
xdp_md->hw_timestamp == 0 means hw_timestamp is not
available.

Inside the kernel, the ctx rewriter can generate code
to call driver specific function to retrieve the data.

The kfunc approach can be used to *less* common use cases?

> 
> There are three main drawbacks, AFAICT:
> 
> 1. It requires driver developers to write and maintain the code that
> generates the unrolled BPF bytecode to access the metadata fields, which
> is a non-trivial amount of complexity. Maybe this can be abstracted away
> with some internal helpers though (like, e.g., a
> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> the required JMP/MOV/LDX instructions?
> 
> 2. AF_XDP programs won't be able to access the metadata without using a
> custom XDP program that calls the kfuncs and puts the data into the
> metadata area. We could solve this with some code in libxdp, though; if
> this code can be made generic enough (so it just dumps the available
> metadata functions from the running kernel at load time), it may be
> possible to make it generic enough that it will be forward-compatible
> with new versions of the kernel that add new fields, which should
> alleviate Florian's concern about keeping things in sync.
> 
> 3. It will make it harder to consume the metadata when building SKBs. I
> think the CPUMAP and veth use cases are also quite important, and that
> we want metadata to be available for building SKBs in this path. Maybe
> this can be resolved by having a convenient kfunc for this that can be
> used for programs doing such redirects. E.g., you could just call
> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> would recursively expand into all the kfunc calls needed to extract the
> metadata supported by the SKB path?
> 
> -Toke
>

Stanislav Fomichev Oct. 31, 2022, 10:09 p.m. UTC | #11

On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> > "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> >
> >> Hi all,
> >>
> >> I was closely following this discussion for some time now. Seems we
> >> reached the point where it's getting interesting for me.
> >>
> >> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >>>>>> And it's actually harder to abstract away inter HW generation
> >>>>>> differences if the user space code has to handle all of it.
> >>>>
> >>>> I don't see how its any harder in practice though?
> >>>
> >>> You need to find out what HW/FW/config you're running, right?
> >>> And all you have is a pointer to a blob of unknown type.
> >>>
> >>> Take timestamps for example, some NICs support adjusting the PHC
> >>> or doing SW corrections (with different versions of hw/fw/server
> >>> platforms being capable of both/one/neither).
> >>>
> >>> Sure you can extract all this info with tracing and careful
> >>> inspection via uAPI. But I don't think that's _easier_.
> >>> And the vendors can't run the results thru their validation
> >>> (for whatever that's worth).
> >>>
> >>>>> I've had the same concern:
> >>>>>
> >>>>> Until we have some userspace library that abstracts all these details,
> >>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >>>>> of data and I need to go through the code and see what particular type
> >>>>> it represents for my particular device and how the data I need is
> >>>>> represented there. There are also these "if this is device v1 -> use
> >>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
> >>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
> >>>>> this burden on the driver developers, but I agree that the drawback
> >>>>> here is that we actually have to wait for the implementations to catch
> >>>>> up.
> >>>>
> >>>> I agree with everything there, you will get a blob of data and then
> >>>> will need to know what field you want to read using BTF. But, we
> >>>> already do this for BPF programs all over the place so its not a big
> >>>> lift for us. All other BPF tracing/observability requires the same
> >>>> logic. I think users of BPF in general perhaps XDP/tc are the only
> >>>> place left to write BPF programs without thinking about BTF and
> >>>> kernel data structures.
> >>>>
> >>>> But, with proposed kptr the complexity lives in userspace and can be
> >>>> fixed, added, updated without having to bother with kernel updates, etc.
> >>>>  From my point of view of supporting Cilium its a win and much preferred
> >>>> to having to deal with driver owners on all cloud vendors, distributions,
> >>>> and so on.
> >>>>
> >>>> If vendor updates firmware with new fields I get those immediately.
> >>>
> >>> Conversely it's a valid concern that those who *do* actually update
> >>> their kernel regularly will have more things to worry about.
> >>>
> >>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >>>>> programs might have to take a lot of other state into consideration
> >>>>> when parsing the descriptors; all those details do seem like they
> >>>>> belong to the driver code.
> >>>>
> >>>> I would prefer to avoid being stuck on requiring driver writers to
> >>>> be involved. With just a kptr I can support the device and any
> >>>> firwmare versions without requiring help.
> >>>
> >>> 1) where are you getting all those HW / FW specs :S
> >>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>>
> >>>>> Feel free to send it early with just a handful of drivers implemented;
> >>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
> >>>>> nice sample/test case that shows how the metadata can be used, that
> >>>>> might push us closer to the agreement on the best way to proceed.
> >>>>
> >>>> I'll try to do a intel and mlx implementation to get a cross section.
> >>>> I have a good collection of nics here so should be able to show a
> >>>> couple firmware versions. It could be fine I think to have the raw
> >>>> kptr access and then also kfuncs for some things perhaps.
> >>>>
> >>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
> >>>>>> parsing to user space will indeed result in what you just said - major
> >>>>>> vendors are supported and that's it.
> >>>>
> >>>> I'm not sure about why it would make it harder for new vendors? I think
> >>>> the opposite,
> >>>
> >>> TBH I'm only replying to the email because of the above part :)
> >>> I thought this would be self evident, but I guess our perspectives
> >>> are different.
> >>>
> >>> Perhaps you look at it from the perspective of SW running on someone
> >>> else's cloud, an being able to move to another cloud, without having
> >>> to worry if feature X is available in xdp or just skb.
> >>>
> >>> I look at it from the perspective of maintaining a cloud, with people
> >>> writing random XDP applications. If I swap a NIC from an incumbent to a
> >>> (superior) startup, and cloud users are messing with raw descriptor -
> >>> I'd need to go find every XDP program out there and make sure it
> >>> understands the new descriptors.
> >>
> >> Here is another perspective:
> >>
> >> As AF_XDP application developer I don't wan't to deal with the
> >> underlying hardware in detail. I like to request a feature from the OS
> >> (in this case rx/tx timestamping). If the feature is available I will
> >> simply use it, if not I might have to work around it - maybe by falling
> >> back to SW timestamping.
> >>
> >> All parts of my application (BPF program included) should not be
> >> optimized/adjusted for all the different HW variants out there.
> >
> > Yes, absolutely agreed. Abstracting away those kinds of hardware
> > differences is the whole *point* of having an OS/driver model. I.e.,
> > it's what the kernel is there for! If people want to bypass that and get
> > direct access to the hardware, they can already do that by using DPDK.
> >
> > So in other words, 100% agreed that we should not expect the BPF
> > developers to deal with hardware details as would be required with a
> > kptr-based interface.
> >
> > As for the kfunc-based interface, I think it shows some promise.
> > Exposing a list of function names to retrieve individual metadata items
> > instead of a struct layout is sorta comparable in terms of developer UI
> > accessibility etc (IMO).
>
> Looks like there are quite some use cases for hw_timestamp.
> Do you think we could add it to the uapi like struct xdp_md?
>
> The following is the current xdp_md:
> struct xdp_md {
>          __u32 data;
>          __u32 data_end;
>          __u32 data_meta;
>          /* Below access go through struct xdp_rxq_info */
>          __u32 ingress_ifindex; /* rxq->dev->ifindex */
>          __u32 rx_queue_index;  /* rxq->queue_index  */
>
>          __u32 egress_ifindex;  /* txq->dev->ifindex */
> };
>
> We could add  __u64 hw_timestamp to the xdp_md so user
> can just do xdp_md->hw_timestamp to get the value.
> xdp_md->hw_timestamp == 0 means hw_timestamp is not
> available.
>
> Inside the kernel, the ctx rewriter can generate code
> to call driver specific function to retrieve the data.

If the driver generates the code to retrieve the data, how's that
different from the kfunc approach?
The only difference I see is that it would be a more strong UAPI than
the kfuncs?

> The kfunc approach can be used to *less* common use cases?

What's the advantage of having two approaches when one can cover
common and uncommon cases?

> > There are three main drawbacks, AFAICT:
> >
> > 1. It requires driver developers to write and maintain the code that
> > generates the unrolled BPF bytecode to access the metadata fields, which
> > is a non-trivial amount of complexity. Maybe this can be abstracted away
> > with some internal helpers though (like, e.g., a
> > bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> > the required JMP/MOV/LDX instructions?
> >
> > 2. AF_XDP programs won't be able to access the metadata without using a
> > custom XDP program that calls the kfuncs and puts the data into the
> > metadata area. We could solve this with some code in libxdp, though; if
> > this code can be made generic enough (so it just dumps the available
> > metadata functions from the running kernel at load time), it may be
> > possible to make it generic enough that it will be forward-compatible
> > with new versions of the kernel that add new fields, which should
> > alleviate Florian's concern about keeping things in sync.
> >
> > 3. It will make it harder to consume the metadata when building SKBs. I
> > think the CPUMAP and veth use cases are also quite important, and that
> > we want metadata to be available for building SKBs in this path. Maybe
> > this can be resolved by having a convenient kfunc for this that can be
> > used for programs doing such redirects. E.g., you could just call
> > xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> > would recursively expand into all the kfunc calls needed to extract the
> > metadata supported by the SKB path?
> >
> > -Toke
> >

Yonghong Song Oct. 31, 2022, 10:38 p.m. UTC | #12

On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
>>
>>
>>
>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
>>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
>>>
>>>> Hi all,
>>>>
>>>> I was closely following this discussion for some time now. Seems we
>>>> reached the point where it's getting interesting for me.
>>>>
>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>>>>>>>> And it's actually harder to abstract away inter HW generation
>>>>>>>> differences if the user space code has to handle all of it.
>>>>>>
>>>>>> I don't see how its any harder in practice though?
>>>>>
>>>>> You need to find out what HW/FW/config you're running, right?
>>>>> And all you have is a pointer to a blob of unknown type.
>>>>>
>>>>> Take timestamps for example, some NICs support adjusting the PHC
>>>>> or doing SW corrections (with different versions of hw/fw/server
>>>>> platforms being capable of both/one/neither).
>>>>>
>>>>> Sure you can extract all this info with tracing and careful
>>>>> inspection via uAPI. But I don't think that's _easier_.
>>>>> And the vendors can't run the results thru their validation
>>>>> (for whatever that's worth).
>>>>>
>>>>>>> I've had the same concern:
>>>>>>>
>>>>>>> Until we have some userspace library that abstracts all these details,
>>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>>>>>>> of data and I need to go through the code and see what particular type
>>>>>>> it represents for my particular device and how the data I need is
>>>>>>> represented there. There are also these "if this is device v1 -> use
>>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
>>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
>>>>>>> this burden on the driver developers, but I agree that the drawback
>>>>>>> here is that we actually have to wait for the implementations to catch
>>>>>>> up.
>>>>>>
>>>>>> I agree with everything there, you will get a blob of data and then
>>>>>> will need to know what field you want to read using BTF. But, we
>>>>>> already do this for BPF programs all over the place so its not a big
>>>>>> lift for us. All other BPF tracing/observability requires the same
>>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
>>>>>> place left to write BPF programs without thinking about BTF and
>>>>>> kernel data structures.
>>>>>>
>>>>>> But, with proposed kptr the complexity lives in userspace and can be
>>>>>> fixed, added, updated without having to bother with kernel updates, etc.
>>>>>>   From my point of view of supporting Cilium its a win and much preferred
>>>>>> to having to deal with driver owners on all cloud vendors, distributions,
>>>>>> and so on.
>>>>>>
>>>>>> If vendor updates firmware with new fields I get those immediately.
>>>>>
>>>>> Conversely it's a valid concern that those who *do* actually update
>>>>> their kernel regularly will have more things to worry about.
>>>>>
>>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>>>>>>> programs might have to take a lot of other state into consideration
>>>>>>> when parsing the descriptors; all those details do seem like they
>>>>>>> belong to the driver code.
>>>>>>
>>>>>> I would prefer to avoid being stuck on requiring driver writers to
>>>>>> be involved. With just a kptr I can support the device and any
>>>>>> firwmare versions without requiring help.
>>>>>
>>>>> 1) where are you getting all those HW / FW specs :S
>>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>>>>>
>>>>>>> Feel free to send it early with just a handful of drivers implemented;
>>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
>>>>>>> nice sample/test case that shows how the metadata can be used, that
>>>>>>> might push us closer to the agreement on the best way to proceed.
>>>>>>
>>>>>> I'll try to do a intel and mlx implementation to get a cross section.
>>>>>> I have a good collection of nics here so should be able to show a
>>>>>> couple firmware versions. It could be fine I think to have the raw
>>>>>> kptr access and then also kfuncs for some things perhaps.
>>>>>>
>>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
>>>>>>>> parsing to user space will indeed result in what you just said - major
>>>>>>>> vendors are supported and that's it.
>>>>>>
>>>>>> I'm not sure about why it would make it harder for new vendors? I think
>>>>>> the opposite,
>>>>>
>>>>> TBH I'm only replying to the email because of the above part :)
>>>>> I thought this would be self evident, but I guess our perspectives
>>>>> are different.
>>>>>
>>>>> Perhaps you look at it from the perspective of SW running on someone
>>>>> else's cloud, an being able to move to another cloud, without having
>>>>> to worry if feature X is available in xdp or just skb.
>>>>>
>>>>> I look at it from the perspective of maintaining a cloud, with people
>>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
>>>>> (superior) startup, and cloud users are messing with raw descriptor -
>>>>> I'd need to go find every XDP program out there and make sure it
>>>>> understands the new descriptors.
>>>>
>>>> Here is another perspective:
>>>>
>>>> As AF_XDP application developer I don't wan't to deal with the
>>>> underlying hardware in detail. I like to request a feature from the OS
>>>> (in this case rx/tx timestamping). If the feature is available I will
>>>> simply use it, if not I might have to work around it - maybe by falling
>>>> back to SW timestamping.
>>>>
>>>> All parts of my application (BPF program included) should not be
>>>> optimized/adjusted for all the different HW variants out there.
>>>
>>> Yes, absolutely agreed. Abstracting away those kinds of hardware
>>> differences is the whole *point* of having an OS/driver model. I.e.,
>>> it's what the kernel is there for! If people want to bypass that and get
>>> direct access to the hardware, they can already do that by using DPDK.
>>>
>>> So in other words, 100% agreed that we should not expect the BPF
>>> developers to deal with hardware details as would be required with a
>>> kptr-based interface.
>>>
>>> As for the kfunc-based interface, I think it shows some promise.
>>> Exposing a list of function names to retrieve individual metadata items
>>> instead of a struct layout is sorta comparable in terms of developer UI
>>> accessibility etc (IMO).
>>
>> Looks like there are quite some use cases for hw_timestamp.
>> Do you think we could add it to the uapi like struct xdp_md?
>>
>> The following is the current xdp_md:
>> struct xdp_md {
>>           __u32 data;
>>           __u32 data_end;
>>           __u32 data_meta;
>>           /* Below access go through struct xdp_rxq_info */
>>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>           __u32 rx_queue_index;  /* rxq->queue_index  */
>>
>>           __u32 egress_ifindex;  /* txq->dev->ifindex */
>> };
>>
>> We could add  __u64 hw_timestamp to the xdp_md so user
>> can just do xdp_md->hw_timestamp to get the value.
>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>> available.
>>
>> Inside the kernel, the ctx rewriter can generate code
>> to call driver specific function to retrieve the data.
> 
> If the driver generates the code to retrieve the data, how's that
> different from the kfunc approach?
> The only difference I see is that it would be a more strong UAPI than
> the kfuncs?

Right. it is a strong uapi.

> 
>> The kfunc approach can be used to *less* common use cases?
> 
> What's the advantage of having two approaches when one can cover
> common and uncommon cases?

Beyond hw_timestamp, do we have any other fields ready to support?

If it ends up with lots of fields to be accessed by the bpf program,
and bpf program actually intends to access these fields,
using a strong uapi might be a good thing as it can make code
much streamlined.

> 
>>> There are three main drawbacks, AFAICT:
>>>
>>> 1. It requires driver developers to write and maintain the code that
>>> generates the unrolled BPF bytecode to access the metadata fields, which
>>> is a non-trivial amount of complexity. Maybe this can be abstracted away
>>> with some internal helpers though (like, e.g., a
>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
>>> the required JMP/MOV/LDX instructions?
>>>
>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>> custom XDP program that calls the kfuncs and puts the data into the
>>> metadata area. We could solve this with some code in libxdp, though; if
>>> this code can be made generic enough (so it just dumps the available
>>> metadata functions from the running kernel at load time), it may be
>>> possible to make it generic enough that it will be forward-compatible
>>> with new versions of the kernel that add new fields, which should
>>> alleviate Florian's concern about keeping things in sync.
>>>
>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>> think the CPUMAP and veth use cases are also quite important, and that
>>> we want metadata to be available for building SKBs in this path. Maybe
>>> this can be resolved by having a convenient kfunc for this that can be
>>> used for programs doing such redirects. E.g., you could just call
>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>> would recursively expand into all the kfunc calls needed to extract the
>>> metadata supported by the SKB path?
>>>
>>> -Toke
>>>

Stanislav Fomichev Oct. 31, 2022, 10:55 p.m. UTC | #13

On Mon, Oct 31, 2022 at 3:38 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
> >>
> >>
> >>
> >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I was closely following this discussion for some time now. Seems we
> >>>> reached the point where it's getting interesting for me.
> >>>>
> >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >>>>>>>> And it's actually harder to abstract away inter HW generation
> >>>>>>>> differences if the user space code has to handle all of it.
> >>>>>>
> >>>>>> I don't see how its any harder in practice though?
> >>>>>
> >>>>> You need to find out what HW/FW/config you're running, right?
> >>>>> And all you have is a pointer to a blob of unknown type.
> >>>>>
> >>>>> Take timestamps for example, some NICs support adjusting the PHC
> >>>>> or doing SW corrections (with different versions of hw/fw/server
> >>>>> platforms being capable of both/one/neither).
> >>>>>
> >>>>> Sure you can extract all this info with tracing and careful
> >>>>> inspection via uAPI. But I don't think that's _easier_.
> >>>>> And the vendors can't run the results thru their validation
> >>>>> (for whatever that's worth).
> >>>>>
> >>>>>>> I've had the same concern:
> >>>>>>>
> >>>>>>> Until we have some userspace library that abstracts all these details,
> >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >>>>>>> of data and I need to go through the code and see what particular type
> >>>>>>> it represents for my particular device and how the data I need is
> >>>>>>> represented there. There are also these "if this is device v1 -> use
> >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
> >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
> >>>>>>> this burden on the driver developers, but I agree that the drawback
> >>>>>>> here is that we actually have to wait for the implementations to catch
> >>>>>>> up.
> >>>>>>
> >>>>>> I agree with everything there, you will get a blob of data and then
> >>>>>> will need to know what field you want to read using BTF. But, we
> >>>>>> already do this for BPF programs all over the place so its not a big
> >>>>>> lift for us. All other BPF tracing/observability requires the same
> >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
> >>>>>> place left to write BPF programs without thinking about BTF and
> >>>>>> kernel data structures.
> >>>>>>
> >>>>>> But, with proposed kptr the complexity lives in userspace and can be
> >>>>>> fixed, added, updated without having to bother with kernel updates, etc.
> >>>>>>   From my point of view of supporting Cilium its a win and much preferred
> >>>>>> to having to deal with driver owners on all cloud vendors, distributions,
> >>>>>> and so on.
> >>>>>>
> >>>>>> If vendor updates firmware with new fields I get those immediately.
> >>>>>
> >>>>> Conversely it's a valid concern that those who *do* actually update
> >>>>> their kernel regularly will have more things to worry about.
> >>>>>
> >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >>>>>>> programs might have to take a lot of other state into consideration
> >>>>>>> when parsing the descriptors; all those details do seem like they
> >>>>>>> belong to the driver code.
> >>>>>>
> >>>>>> I would prefer to avoid being stuck on requiring driver writers to
> >>>>>> be involved. With just a kptr I can support the device and any
> >>>>>> firwmare versions without requiring help.
> >>>>>
> >>>>> 1) where are you getting all those HW / FW specs :S
> >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>>>>
> >>>>>>> Feel free to send it early with just a handful of drivers implemented;
> >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
> >>>>>>> nice sample/test case that shows how the metadata can be used, that
> >>>>>>> might push us closer to the agreement on the best way to proceed.
> >>>>>>
> >>>>>> I'll try to do a intel and mlx implementation to get a cross section.
> >>>>>> I have a good collection of nics here so should be able to show a
> >>>>>> couple firmware versions. It could be fine I think to have the raw
> >>>>>> kptr access and then also kfuncs for some things perhaps.
> >>>>>>
> >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
> >>>>>>>> parsing to user space will indeed result in what you just said - major
> >>>>>>>> vendors are supported and that's it.
> >>>>>>
> >>>>>> I'm not sure about why it would make it harder for new vendors? I think
> >>>>>> the opposite,
> >>>>>
> >>>>> TBH I'm only replying to the email because of the above part :)
> >>>>> I thought this would be self evident, but I guess our perspectives
> >>>>> are different.
> >>>>>
> >>>>> Perhaps you look at it from the perspective of SW running on someone
> >>>>> else's cloud, an being able to move to another cloud, without having
> >>>>> to worry if feature X is available in xdp or just skb.
> >>>>>
> >>>>> I look at it from the perspective of maintaining a cloud, with people
> >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
> >>>>> (superior) startup, and cloud users are messing with raw descriptor -
> >>>>> I'd need to go find every XDP program out there and make sure it
> >>>>> understands the new descriptors.
> >>>>
> >>>> Here is another perspective:
> >>>>
> >>>> As AF_XDP application developer I don't wan't to deal with the
> >>>> underlying hardware in detail. I like to request a feature from the OS
> >>>> (in this case rx/tx timestamping). If the feature is available I will
> >>>> simply use it, if not I might have to work around it - maybe by falling
> >>>> back to SW timestamping.
> >>>>
> >>>> All parts of my application (BPF program included) should not be
> >>>> optimized/adjusted for all the different HW variants out there.
> >>>
> >>> Yes, absolutely agreed. Abstracting away those kinds of hardware
> >>> differences is the whole *point* of having an OS/driver model. I.e.,
> >>> it's what the kernel is there for! If people want to bypass that and get
> >>> direct access to the hardware, they can already do that by using DPDK.
> >>>
> >>> So in other words, 100% agreed that we should not expect the BPF
> >>> developers to deal with hardware details as would be required with a
> >>> kptr-based interface.
> >>>
> >>> As for the kfunc-based interface, I think it shows some promise.
> >>> Exposing a list of function names to retrieve individual metadata items
> >>> instead of a struct layout is sorta comparable in terms of developer UI
> >>> accessibility etc (IMO).
> >>
> >> Looks like there are quite some use cases for hw_timestamp.
> >> Do you think we could add it to the uapi like struct xdp_md?
> >>
> >> The following is the current xdp_md:
> >> struct xdp_md {
> >>           __u32 data;
> >>           __u32 data_end;
> >>           __u32 data_meta;
> >>           /* Below access go through struct xdp_rxq_info */
> >>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
> >>           __u32 rx_queue_index;  /* rxq->queue_index  */
> >>
> >>           __u32 egress_ifindex;  /* txq->dev->ifindex */
> >> };
> >>
> >> We could add  __u64 hw_timestamp to the xdp_md so user
> >> can just do xdp_md->hw_timestamp to get the value.
> >> xdp_md->hw_timestamp == 0 means hw_timestamp is not
> >> available.
> >>
> >> Inside the kernel, the ctx rewriter can generate code
> >> to call driver specific function to retrieve the data.
> >
> > If the driver generates the code to retrieve the data, how's that
> > different from the kfunc approach?
> > The only difference I see is that it would be a more strong UAPI than
> > the kfuncs?
>
> Right. it is a strong uapi.
>
> >
> >> The kfunc approach can be used to *less* common use cases?
> >
> > What's the advantage of having two approaches when one can cover
> > common and uncommon cases?
>
> Beyond hw_timestamp, do we have any other fields ready to support?
>
> If it ends up with lots of fields to be accessed by the bpf program,
> and bpf program actually intends to access these fields,
> using a strong uapi might be a good thing as it can make code
> much streamlined.

There are a bunch. Alexander's series has a good list:

https://github.com/alobakin/linux/commit/31bfe8035c995fdf4f1e378b3429d24b96846cc8

We can definitely call some of them more "common" than the others, but
not sure how strong of a definition that would be.

> >
> >>> There are three main drawbacks, AFAICT:
> >>>
> >>> 1. It requires driver developers to write and maintain the code that
> >>> generates the unrolled BPF bytecode to access the metadata fields, which
> >>> is a non-trivial amount of complexity. Maybe this can be abstracted away
> >>> with some internal helpers though (like, e.g., a
> >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> >>> the required JMP/MOV/LDX instructions?
> >>>
> >>> 2. AF_XDP programs won't be able to access the metadata without using a
> >>> custom XDP program that calls the kfuncs and puts the data into the
> >>> metadata area. We could solve this with some code in libxdp, though; if
> >>> this code can be made generic enough (so it just dumps the available
> >>> metadata functions from the running kernel at load time), it may be
> >>> possible to make it generic enough that it will be forward-compatible
> >>> with new versions of the kernel that add new fields, which should
> >>> alleviate Florian's concern about keeping things in sync.
> >>>
> >>> 3. It will make it harder to consume the metadata when building SKBs. I
> >>> think the CPUMAP and veth use cases are also quite important, and that
> >>> we want metadata to be available for building SKBs in this path. Maybe
> >>> this can be resolved by having a convenient kfunc for this that can be
> >>> used for programs doing such redirects. E.g., you could just call
> >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>> would recursively expand into all the kfunc calls needed to extract the
> >>> metadata supported by the SKB path?
> >>>
> >>> -Toke
> >>>

Martin KaFai Lau Oct. 31, 2022, 10:57 p.m. UTC | #14

On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>> 2. AF_XDP programs won't be able to access the metadata without using a
>> custom XDP program that calls the kfuncs and puts the data into the
>> metadata area. We could solve this with some code in libxdp, though; if
>> this code can be made generic enough (so it just dumps the available
>> metadata functions from the running kernel at load time), it may be
>> possible to make it generic enough that it will be forward-compatible
>> with new versions of the kernel that add new fields, which should
>> alleviate Florian's concern about keeping things in sync.
> 
> Good point. I had to convert to a custom program to use the kfuncs :-(
> But your suggestion sounds good; maybe libxdp can accept some extra
> info about at which offset the user would like to place the metadata
> and the library can generate the required bytecode?
> 
>> 3. It will make it harder to consume the metadata when building SKBs. I
>> think the CPUMAP and veth use cases are also quite important, and that
>> we want metadata to be available for building SKBs in this path. Maybe
>> this can be resolved by having a convenient kfunc for this that can be
>> used for programs doing such redirects. E.g., you could just call
>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>> would recursively expand into all the kfunc calls needed to extract the
>> metadata supported by the SKB path?
> 
> So this xdp_copy_metadata_for_skb will create a metadata layout that

Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
Not sure where is the best point to specify this prog though.  Somehow during 
bpf_xdp_redirect_map?
or this prog belongs to the target cpumap and the xdp prog redirecting to this 
cpumap has to write the meta layout in a way that the cpumap is expecting?


> the kernel will be able to understand when converting back to skb?
> IIUC, the xdp program will look something like the following:
> 
> if (xdp packet is to be consumed by af_xdp) {
>    // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
> own metadata layout
>    return bpf_redirect_map(xsk, ...);
> } else {
>    // if the packet is to be consumed by the kernel
>    xdp_copy_metadata_for_skb(ctx);
>    return bpf_redirect(...);
> }
> 
> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
> put some magic number in the first byte(s) of the metadata so the
> kernel can check whether xdp_copy_metadata_for_skb has been called
> previously (or maybe xdp_frame can carry this extra signal, idk).

Stanislav Fomichev Nov. 1, 2022, 1:59 a.m. UTC | #15

On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
> >> 2. AF_XDP programs won't be able to access the metadata without using a
> >> custom XDP program that calls the kfuncs and puts the data into the
> >> metadata area. We could solve this with some code in libxdp, though; if
> >> this code can be made generic enough (so it just dumps the available
> >> metadata functions from the running kernel at load time), it may be
> >> possible to make it generic enough that it will be forward-compatible
> >> with new versions of the kernel that add new fields, which should
> >> alleviate Florian's concern about keeping things in sync.
> >
> > Good point. I had to convert to a custom program to use the kfuncs :-(
> > But your suggestion sounds good; maybe libxdp can accept some extra
> > info about at which offset the user would like to place the metadata
> > and the library can generate the required bytecode?
> >
> >> 3. It will make it harder to consume the metadata when building SKBs. I
> >> think the CPUMAP and veth use cases are also quite important, and that
> >> we want metadata to be available for building SKBs in this path. Maybe
> >> this can be resolved by having a convenient kfunc for this that can be
> >> used for programs doing such redirects. E.g., you could just call
> >> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >> would recursively expand into all the kfunc calls needed to extract the
> >> metadata supported by the SKB path?
> >
> > So this xdp_copy_metadata_for_skb will create a metadata layout that
>
> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
> Not sure where is the best point to specify this prog though.  Somehow during
> bpf_xdp_redirect_map?
> or this prog belongs to the target cpumap and the xdp prog redirecting to this
> cpumap has to write the meta layout in a way that the cpumap is expecting?

We're probably interested in triggering it from the places where xdp
frames can eventually be converted into skbs?
So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
anything that's not XDP_DROP / AF_XDP redirect).
We can probably make it magically work, and can generate
kernel-digestible metadata whenever data == data_meta, but the
question - should we?
(need to make sure we won't regress any existing cases that are not
relying on the metadata)






> > the kernel will be able to understand when converting back to skb?
> > IIUC, the xdp program will look something like the following:
> >
> > if (xdp packet is to be consumed by af_xdp) {
> >    // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
> > own metadata layout
> >    return bpf_redirect_map(xsk, ...);
> > } else {
> >    // if the packet is to be consumed by the kernel
> >    xdp_copy_metadata_for_skb(ctx);
> >    return bpf_redirect(...);
> > }
> >
> > Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
> > put some magic number in the first byte(s) of the metadata so the
> > kernel can check whether xdp_copy_metadata_for_skb has been called
> > previously (or maybe xdp_frame can carry this extra signal, idk).

Toke Høiland-Jørgensen Nov. 1, 2022, 12:52 p.m. UTC | #16

Stanislav Fomichev <sdf@google.com> writes:

> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>> >> 2. AF_XDP programs won't be able to access the metadata without using a
>> >> custom XDP program that calls the kfuncs and puts the data into the
>> >> metadata area. We could solve this with some code in libxdp, though; if
>> >> this code can be made generic enough (so it just dumps the available
>> >> metadata functions from the running kernel at load time), it may be
>> >> possible to make it generic enough that it will be forward-compatible
>> >> with new versions of the kernel that add new fields, which should
>> >> alleviate Florian's concern about keeping things in sync.
>> >
>> > Good point. I had to convert to a custom program to use the kfuncs :-(
>> > But your suggestion sounds good; maybe libxdp can accept some extra
>> > info about at which offset the user would like to place the metadata
>> > and the library can generate the required bytecode?
>> >
>> >> 3. It will make it harder to consume the metadata when building SKBs. I
>> >> think the CPUMAP and veth use cases are also quite important, and that
>> >> we want metadata to be available for building SKBs in this path. Maybe
>> >> this can be resolved by having a convenient kfunc for this that can be
>> >> used for programs doing such redirects. E.g., you could just call
>> >> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>> >> would recursively expand into all the kfunc calls needed to extract the
>> >> metadata supported by the SKB path?
>> >
>> > So this xdp_copy_metadata_for_skb will create a metadata layout that
>>
>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>> Not sure where is the best point to specify this prog though.  Somehow during
>> bpf_xdp_redirect_map?
>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>> cpumap has to write the meta layout in a way that the cpumap is expecting?
>
> We're probably interested in triggering it from the places where xdp
> frames can eventually be converted into skbs?
> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> anything that's not XDP_DROP / AF_XDP redirect).
> We can probably make it magically work, and can generate
> kernel-digestible metadata whenever data == data_meta, but the
> question - should we?
> (need to make sure we won't regress any existing cases that are not
> relying on the metadata)

So I was thinking about whether we could have the kernel do this
automatically, and concluded that this was probably not feasible in
general, which is why I suggested the explicit helper. My reasoning was
as follows:

For straight XDP_PASS in the driver we don't actually need to do
anything today, as the driver itself will build the SKB and read any
metadata it needs from the HW descriptor[0].

This leaves packets that are redirected (either to a veth or a cpumap so
we build SKBs from them later); here the problem is that we buffer the
packets (for performance reasons) so that the redirect doesn't actually
happen until after the driver exits the NAPI loop. At which point we
don't have access to the HW descriptors anymore, so we can't actually
read the metadata.

This means that if we want to execute the metadata gathering
automatically, we'd have to do it in xdp_do_redirect(). Which means that
we'll have to figure out, at that point, whether the XDP frame is likely
to be converted to an SKB. This will add at least one branch (and
probably more) that will be in-path for every redirected frame.

Hence, making it up to the XDP program itself to decide whether it will
need the metadata for SKB conversion seems like a better choice, as long
as we make it easy for the XDP program to do this. Instead of a helper,
this could also simply be a new flag to the bpf_redirect{,_map}()
helpers (either opt-in or opt-out depending on the overhead), which
would be even simpler?

I.e.,

return bpf_redirect_map(&cpumap, 0, BPF_F_PREPARE_SKB_METADATA);

-Toke

[0] As an aside, in the future drivers may want to take advantage of the
XDP-specific metadata reading also when building SKBs (so it doesn't
have to implement it in both BPF and C code). For this, we could expose
a new internal helper function that the drivers could call to simply
execute the XDP-to-skb metadata helpers the same way the stack/helper
does.

David Ahern Nov. 1, 2022, 1:43 p.m. UTC | #17

On 11/1/22 6:52 AM, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>>> this code can be made generic enough (so it just dumps the available
>>>>> metadata functions from the running kernel at load time), it may be
>>>>> possible to make it generic enough that it will be forward-compatible
>>>>> with new versions of the kernel that add new fields, which should
>>>>> alleviate Florian's concern about keeping things in sync.
>>>>
>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>> info about at which offset the user would like to place the metadata
>>>> and the library can generate the required bytecode?
>>>>
>>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>> used for programs doing such redirects. E.g., you could just call
>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>> would recursively expand into all the kfunc calls needed to extract the
>>>>> metadata supported by the SKB path?
>>>>
>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>
>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>> Not sure where is the best point to specify this prog though.  Somehow during
>>> bpf_xdp_redirect_map?
>>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>>> cpumap has to write the meta layout in a way that the cpumap is expecting?
>>
>> We're probably interested in triggering it from the places where xdp
>> frames can eventually be converted into skbs?
>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>> anything that's not XDP_DROP / AF_XDP redirect).
>> We can probably make it magically work, and can generate
>> kernel-digestible metadata whenever data == data_meta, but the
>> question - should we?
>> (need to make sure we won't regress any existing cases that are not
>> relying on the metadata)
> 
> So I was thinking about whether we could have the kernel do this
> automatically, and concluded that this was probably not feasible in
> general, which is why I suggested the explicit helper. My reasoning was
> as follows:
> 
> For straight XDP_PASS in the driver we don't actually need to do
> anything today, as the driver itself will build the SKB and read any
> metadata it needs from the HW descriptor[0].

The program can pop encap headers, mpls tags, ... and thus affect the
metadata in the descriptor (besides the timestamp).

> 
> This leaves packets that are redirected (either to a veth or a cpumap so
> we build SKBs from them later); here the problem is that we buffer the
> packets (for performance reasons) so that the redirect doesn't actually
> happen until after the driver exits the NAPI loop. At which point we
> don't have access to the HW descriptors anymore, so we can't actually
> read the metadata.
> 
> This means that if we want to execute the metadata gathering
> automatically, we'd have to do it in xdp_do_redirect(). Which means that
> we'll have to figure out, at that point, whether the XDP frame is likely
> to be converted to an SKB. This will add at least one branch (and
> probably more) that will be in-path for every redirected frame.

or forwarded to a tun device as an xdp frame and wanting to pass
metadata into a VM which may construct an skb in the guest. This case is
arguably aligned with the redirect from vendor1 to vendor2.

This thread (and others) seem to be focused on the Rx path, but the Tx
path is equally important with similar needs.

Toke Høiland-Jørgensen Nov. 1, 2022, 2:20 p.m. UTC | #18

David Ahern <dsahern@gmail.com> writes:

> On 11/1/22 6:52 AM, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>> with new versions of the kernel that add new fields, which should
>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>
>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>> info about at which offset the user would like to place the metadata
>>>>> and the library can generate the required bytecode?
>>>>>
>>>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>> would recursively expand into all the kfunc calls needed to extract the
>>>>>> metadata supported by the SKB path?
>>>>>
>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>
>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>> Not sure where is the best point to specify this prog though.  Somehow during
>>>> bpf_xdp_redirect_map?
>>>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>>>> cpumap has to write the meta layout in a way that the cpumap is expecting?
>>>
>>> We're probably interested in triggering it from the places where xdp
>>> frames can eventually be converted into skbs?
>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>> anything that's not XDP_DROP / AF_XDP redirect).
>>> We can probably make it magically work, and can generate
>>> kernel-digestible metadata whenever data == data_meta, but the
>>> question - should we?
>>> (need to make sure we won't regress any existing cases that are not
>>> relying on the metadata)
>> 
>> So I was thinking about whether we could have the kernel do this
>> automatically, and concluded that this was probably not feasible in
>> general, which is why I suggested the explicit helper. My reasoning was
>> as follows:
>> 
>> For straight XDP_PASS in the driver we don't actually need to do
>> anything today, as the driver itself will build the SKB and read any
>> metadata it needs from the HW descriptor[0].
>
> The program can pop encap headers, mpls tags, ... and thus affect the
> metadata in the descriptor (besides the timestamp).

Hmm, right, good point. How does XDP_PASS deal with that today, though?

I guess this is an argument for making the "read HW metadata into SKB
format" thing be a kfunc/helper rather than a flag to bpf_redirect(),
then. Because then we can allow the XDP program to override/modify the
metadata afterwards, either by defining it as:

int xdp_copy_metadata_for_skb(struct xdp_md *ctx, struct xdp_skb_meta *override, int flags)

where the XDP program can fill in 'override' with new data that takes
precedence over the stuff from the HW (like a modified checksum or
offset or something).

Or we can just have xdp_copy_metadata_for_skb() into the regular XDP
metadata area, and let the XDP program modify it afterwards. I feel like
the override argument would be easier to use, though.

Also, having it be completely opaque *where* the metadata is stored when
using xdp_copy_metadata_for_skb() lets us be more flexible about it.
E.g., the helper could write the timestamp directly into
skb_shared_info, instead of stuffing it into the metadata area where it
then has to be copied out later.

>> This leaves packets that are redirected (either to a veth or a cpumap so
>> we build SKBs from them later); here the problem is that we buffer the
>> packets (for performance reasons) so that the redirect doesn't actually
>> happen until after the driver exits the NAPI loop. At which point we
>> don't have access to the HW descriptors anymore, so we can't actually
>> read the metadata.
>> 
>> This means that if we want to execute the metadata gathering
>> automatically, we'd have to do it in xdp_do_redirect(). Which means that
>> we'll have to figure out, at that point, whether the XDP frame is likely
>> to be converted to an SKB. This will add at least one branch (and
>> probably more) that will be in-path for every redirected frame.
>
> or forwarded to a tun device as an xdp frame and wanting to pass
> metadata into a VM which may construct an skb in the guest. This case is
> arguably aligned with the redirect from vendor1 to vendor2.
>
> This thread (and others) seem to be focused on the Rx path, but the Tx
> path is equally important with similar needs.

You're right, of course. Thinking a bit out loud here, but I actually
think the kfunc approach makes the TX side easier:

We already have to ability to execute a second "TX" XDP program inside
the devmaps. At which point that program is also tied to a particular
interface. So we could duplicate the RX-side kfunc trick, and expose a
set of *writer* kfuncs for metadata. So that an XDP program in the
devmap can simply do:

if (bpf_xdp_metadata_tx_timestamp_supported())
  bpf_xdp_metadata_tx_timestamp(ctx, tsval);

and those two kfuncs will be unrolled by the TX-side driver as well to
store them wherever they need to go to reach the wire.

The one complication here being, of course, that by the time the devmap
XDP program is executed, the driver hasn't seen the frame at all, yet,
so it doesn't have anywhere to store that data. We'd need to reuse the
frame metadata area for this (with some flag indicating that it's
valid), or we'd need a new area the driver could use as scratch space
specific to the xdp_frame (like the skb->cb field, I suppose).

-Toke

Jesper Dangaard Brouer Nov. 1, 2022, 2:23 p.m. UTC | #19

On 31/10/2022 23.55, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 3:38 PM Yonghong Song<yhs@meta.com>  wrote:
>>
>> On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
>>> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song<yhs@meta.com>  wrote:
>>>>
>>>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
>>>>> "Bezdeka, Florian"<florian.bezdeka@siemens.com>  writes:
>>>>>>
>>>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
[...]
>>>>>> All parts of my application (BPF program included) should not be
>>>>>> optimized/adjusted for all the different HW variants out there.
>>>>> Yes, absolutely agreed. Abstracting away those kinds of hardware
>>>>> differences is the whole*point*  of having an OS/driver model. I.e.,
>>>>> it's what the kernel is there for! If people want to bypass that and get
>>>>> direct access to the hardware, they can already do that by using DPDK.
>>>>>
>>>>> So in other words, 100% agreed that we should not expect the BPF
>>>>> developers to deal with hardware details as would be required with a
>>>>> kptr-based interface.
>>>>>
>>>>> As for the kfunc-based interface, I think it shows some promise.
>>>>> Exposing a list of function names to retrieve individual metadata items
>>>>> instead of a struct layout is sorta comparable in terms of developer UI
>>>>> accessibility etc (IMO).
>>>> >>>> Looks like there are quite some use cases for hw_timestamp.
>>>> Do you think we could add it to the uapi like struct xdp_md?
>>>>
>>>> The following is the current xdp_md:
>>>> struct xdp_md {
>>>>            __u32 data;
>>>>            __u32 data_end;
>>>>            __u32 data_meta;
>>>>            /* Below access go through struct xdp_rxq_info */
>>>>            __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>>>            __u32 rx_queue_index;  /* rxq->queue_index  */
>>>>
>>>>            __u32 egress_ifindex;  /* txq->dev->ifindex */
>>>> };
>>>>
>>>> We could add  __u64 hw_timestamp to the xdp_md so user
>>>> can just do xdp_md->hw_timestamp to get the value.
>>>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>>>> available.
>>>>
>>>> Inside the kernel, the ctx rewriter can generate code
>>>> to call driver specific function to retrieve the data.
>>> If the driver generates the code to retrieve the data, how's that
>>> different from the kfunc approach?
>>> The only difference I see is that it would be a more strong UAPI than
>>> the kfuncs?
>> Right. it is a strong uapi.
>>
>>>> The kfunc approach can be used to*less*  common use cases?
>>> What's the advantage of having two approaches when one can cover
>>> common and uncommon cases?
>>
>> Beyond hw_timestamp, do we have any other fields ready to support?
>>
>> If it ends up with lots of fields to be accessed by the bpf program,
>> and bpf program actually intends to access these fields,
>> using a strong uapi might be a good thing as it can make code
>> much streamlined.
> > There are a bunch. Alexander's series has a good list:
> 
> https://github.com/alobakin/linux/commit/31bfe8035c995fdf4f1e378b3429d24b96846cc8
> 

Below are the fields I've identified, which are close to what Alexander 
also found.

  struct xdp_hints_common {
	union {
		__wsum		csum;
		struct {
			__u16	csum_start;
			__u16	csum_offset;
		};
	};
	u16 rx_queue;
	u16 vlan_tci;
	u32 rx_hash32;
	u32 xdp_hints_flags;
	u64 btf_full_id; /* BTF object + type ID */
  } __attribute__((aligned(4))) __attribute__((packed));

Some of the fields are encoded via flags:

  enum xdp_hints_flags {
	HINT_FLAG_CSUM_TYPE_BIT0  = BIT(0),
	HINT_FLAG_CSUM_TYPE_BIT1  = BIT(1),
	HINT_FLAG_CSUM_TYPE_MASK  = 0x3,

	HINT_FLAG_CSUM_LEVEL_BIT0 = BIT(2),
	HINT_FLAG_CSUM_LEVEL_BIT1 = BIT(3),
	HINT_FLAG_CSUM_LEVEL_MASK = 0xC,
	HINT_FLAG_CSUM_LEVEL_SHIFT = 2,

	HINT_FLAG_RX_HASH_TYPE_BIT0 = BIT(4),
	HINT_FLAG_RX_HASH_TYPE_BIT1 = BIT(5),
	HINT_FLAG_RX_HASH_TYPE_MASK = 0x30,
	HINT_FLAG_RX_HASH_TYPE_SHIFT = 0x4,

	HINT_FLAG_RX_QUEUE = BIT(7),

	HINT_FLAG_VLAN_PRESENT            = BIT(8),
	HINT_FLAG_VLAN_PROTO_ETH_P_8021Q  = BIT(9),
	HINT_FLAG_VLAN_PROTO_ETH_P_8021AD = BIT(10),
	/* Flags from BIT(16) can be used by drivers */
  };

> We can definitely call some of them more "common" than the others, but
> not sure how strong of a definition that would be.

The important fields that would be worth considering as UAPI candidates
are: (1) RX-hash, (2) Hash-type and (3) RX-checksum.
With these three we can avoid calling the flow-dissector and GRO frame
aggregations works. (This currently hurts xdp_frame to SKB performance a
lot in practice).

*BUT* in it's current form above (incl. Alexanders approach/patch) it
would be a mistake to UAPI standardize the "(2) Hash-type" in this
simplified "reduced" form (which is what the SKB "needs").

There is a huge untapped potential in the Hash-type.  Thanks to
Microsoft almost all NIC hardware provided a Hash-type that gives us the
L3-protocol (IPv4 or IPv6) and the L4-protocol (UDP or TCP and sometimes
SCTP), plus info if extention-headers are provided. (Digging in
datasheets, we can often also get the header-size).

Think about how many cycles XDP BPF-prog can save parsing protocol
headers.  I'm also hoping we can leveregate this to allow SKBs created
from an xdp_frame to have skb->transport_header and skb->network_header
pre-populated (and skip some of these netstack layers).

--Jesper

p.s. in my patchset, I exposed the "raw" Hash-type bits from the 
descriptor in hope this would evolve.

Martin KaFai Lau Nov. 1, 2022, 5:05 p.m. UTC | #20

On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>> this code can be made generic enough (so it just dumps the available
>>>> metadata functions from the running kernel at load time), it may be
>>>> possible to make it generic enough that it will be forward-compatible
>>>> with new versions of the kernel that add new fields, which should
>>>> alleviate Florian's concern about keeping things in sync.
>>>
>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>> info about at which offset the user would like to place the metadata
>>> and the library can generate the required bytecode?
>>>
>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>> this can be resolved by having a convenient kfunc for this that can be
>>>> used for programs doing such redirects. E.g., you could just call
>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>> would recursively expand into all the kfunc calls needed to extract the
>>>> metadata supported by the SKB path?
>>>
>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>
>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>> Not sure where is the best point to specify this prog though.  Somehow during
>> bpf_xdp_redirect_map?
>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>> cpumap has to write the meta layout in a way that the cpumap is expecting?
> 
> We're probably interested in triggering it from the places where xdp
> frames can eventually be converted into skbs?
> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> anything that's not XDP_DROP / AF_XDP redirect).
> We can probably make it magically work, and can generate
> kernel-digestible metadata whenever data == data_meta, but the
> question - should we?
> (need to make sure we won't regress any existing cases that are not
> relying on the metadata)

Instead of having some kernel-digestible meta data, how about calling another 
bpf prog to initialize the skb fields from the meta area after 
__xdp_build_skb_from_frame() in the cpumap, so 
run_xdp_set_skb_fileds_from_metadata() may be a better name.

The xdp_prog@rx sets the meta data and then redirect.  If the xdp_prog@rx can 
also specify a xdp prog to initialize the skb fields from the meta area, then 
there is no need to have a kfunc to enforce a kernel-digestible layout.  Not 
sure what is a good way to specify this xdp_prog though...

>>> the kernel will be able to understand when converting back to skb?
>>> IIUC, the xdp program will look something like the following:
>>>
>>> if (xdp packet is to be consumed by af_xdp) {
>>>     // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
>>> own metadata layout
>>>     return bpf_redirect_map(xsk, ...);
>>> } else {
>>>     // if the packet is to be consumed by the kernel
>>>     xdp_copy_metadata_for_skb(ctx);
>>>     return bpf_redirect(...);
>>> }
>>>
>>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
>>> put some magic number in the first byte(s) of the metadata so the
>>> kernel can check whether xdp_copy_metadata_for_skb has been called
>>> previously (or maybe xdp_frame can carry this extra signal, idk).

Martin KaFai Lau Nov. 1, 2022, 5:31 p.m. UTC | #21

On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
>>
>>
>>
>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
>>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
>>>
>>>> Hi all,
>>>>
>>>> I was closely following this discussion for some time now. Seems we
>>>> reached the point where it's getting interesting for me.
>>>>
>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>>>>>>>> And it's actually harder to abstract away inter HW generation
>>>>>>>> differences if the user space code has to handle all of it.
>>>>>>
>>>>>> I don't see how its any harder in practice though?
>>>>>
>>>>> You need to find out what HW/FW/config you're running, right?
>>>>> And all you have is a pointer to a blob of unknown type.
>>>>>
>>>>> Take timestamps for example, some NICs support adjusting the PHC
>>>>> or doing SW corrections (with different versions of hw/fw/server
>>>>> platforms being capable of both/one/neither).
>>>>>
>>>>> Sure you can extract all this info with tracing and careful
>>>>> inspection via uAPI. But I don't think that's _easier_.
>>>>> And the vendors can't run the results thru their validation
>>>>> (for whatever that's worth).
>>>>>
>>>>>>> I've had the same concern:
>>>>>>>
>>>>>>> Until we have some userspace library that abstracts all these details,
>>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>>>>>>> of data and I need to go through the code and see what particular type
>>>>>>> it represents for my particular device and how the data I need is
>>>>>>> represented there. There are also these "if this is device v1 -> use
>>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
>>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
>>>>>>> this burden on the driver developers, but I agree that the drawback
>>>>>>> here is that we actually have to wait for the implementations to catch
>>>>>>> up.
>>>>>>
>>>>>> I agree with everything there, you will get a blob of data and then
>>>>>> will need to know what field you want to read using BTF. But, we
>>>>>> already do this for BPF programs all over the place so its not a big
>>>>>> lift for us. All other BPF tracing/observability requires the same
>>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
>>>>>> place left to write BPF programs without thinking about BTF and
>>>>>> kernel data structures.
>>>>>>
>>>>>> But, with proposed kptr the complexity lives in userspace and can be
>>>>>> fixed, added, updated without having to bother with kernel updates, etc.
>>>>>>   From my point of view of supporting Cilium its a win and much preferred
>>>>>> to having to deal with driver owners on all cloud vendors, distributions,
>>>>>> and so on.
>>>>>>
>>>>>> If vendor updates firmware with new fields I get those immediately.
>>>>>
>>>>> Conversely it's a valid concern that those who *do* actually update
>>>>> their kernel regularly will have more things to worry about.
>>>>>
>>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>>>>>>> programs might have to take a lot of other state into consideration
>>>>>>> when parsing the descriptors; all those details do seem like they
>>>>>>> belong to the driver code.
>>>>>>
>>>>>> I would prefer to avoid being stuck on requiring driver writers to
>>>>>> be involved. With just a kptr I can support the device and any
>>>>>> firwmare versions without requiring help.
>>>>>
>>>>> 1) where are you getting all those HW / FW specs :S
>>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>>>>>
>>>>>>> Feel free to send it early with just a handful of drivers implemented;
>>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
>>>>>>> nice sample/test case that shows how the metadata can be used, that
>>>>>>> might push us closer to the agreement on the best way to proceed.
>>>>>>
>>>>>> I'll try to do a intel and mlx implementation to get a cross section.
>>>>>> I have a good collection of nics here so should be able to show a
>>>>>> couple firmware versions. It could be fine I think to have the raw
>>>>>> kptr access and then also kfuncs for some things perhaps.
>>>>>>
>>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
>>>>>>>> parsing to user space will indeed result in what you just said - major
>>>>>>>> vendors are supported and that's it.
>>>>>>
>>>>>> I'm not sure about why it would make it harder for new vendors? I think
>>>>>> the opposite,
>>>>>
>>>>> TBH I'm only replying to the email because of the above part :)
>>>>> I thought this would be self evident, but I guess our perspectives
>>>>> are different.
>>>>>
>>>>> Perhaps you look at it from the perspective of SW running on someone
>>>>> else's cloud, an being able to move to another cloud, without having
>>>>> to worry if feature X is available in xdp or just skb.
>>>>>
>>>>> I look at it from the perspective of maintaining a cloud, with people
>>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
>>>>> (superior) startup, and cloud users are messing with raw descriptor -
>>>>> I'd need to go find every XDP program out there and make sure it
>>>>> understands the new descriptors.
>>>>
>>>> Here is another perspective:
>>>>
>>>> As AF_XDP application developer I don't wan't to deal with the
>>>> underlying hardware in detail. I like to request a feature from the OS
>>>> (in this case rx/tx timestamping). If the feature is available I will
>>>> simply use it, if not I might have to work around it - maybe by falling
>>>> back to SW timestamping.
>>>>
>>>> All parts of my application (BPF program included) should not be
>>>> optimized/adjusted for all the different HW variants out there.
>>>
>>> Yes, absolutely agreed. Abstracting away those kinds of hardware
>>> differences is the whole *point* of having an OS/driver model. I.e.,
>>> it's what the kernel is there for! If people want to bypass that and get
>>> direct access to the hardware, they can already do that by using DPDK.
>>>
>>> So in other words, 100% agreed that we should not expect the BPF
>>> developers to deal with hardware details as would be required with a
>>> kptr-based interface.
>>>
>>> As for the kfunc-based interface, I think it shows some promise.
>>> Exposing a list of function names to retrieve individual metadata items
>>> instead of a struct layout is sorta comparable in terms of developer UI
>>> accessibility etc (IMO).
>>
>> Looks like there are quite some use cases for hw_timestamp.
>> Do you think we could add it to the uapi like struct xdp_md?
>>
>> The following is the current xdp_md:
>> struct xdp_md {
>>           __u32 data;
>>           __u32 data_end;
>>           __u32 data_meta;
>>           /* Below access go through struct xdp_rxq_info */
>>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>           __u32 rx_queue_index;  /* rxq->queue_index  */
>>
>>           __u32 egress_ifindex;  /* txq->dev->ifindex */
>> };
>>
>> We could add  __u64 hw_timestamp to the xdp_md so user
>> can just do xdp_md->hw_timestamp to get the value.
>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>> available.
>>
>> Inside the kernel, the ctx rewriter can generate code
>> to call driver specific function to retrieve the data.
> 
> If the driver generates the code to retrieve the data, how's that
> different from the kfunc approach?
> The only difference I see is that it would be a more strong UAPI than
> the kfuncs?

Another thing may be worth considering, some hints for some HW/driver may be 
harder (or may not worth) to unroll/inline.  For example, I see driver is doing 
spin_lock_bh while getting the hwtstamp.  For this case, keep calling a kfunc 
and avoid the unroll/inline may be the right thing to do.

> 
>> The kfunc approach can be used to *less* common use cases?
> 
> What's the advantage of having two approaches when one can cover
> common and uncommon cases?
> 
>>> There are three main drawbacks, AFAICT:
>>>
>>> 1. It requires driver developers to write and maintain the code that
>>> generates the unrolled BPF bytecode to access the metadata fields, which
>>> is a non-trivial amount of complexity. Maybe this can be abstracted away
>>> with some internal helpers though (like, e.g., a
>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
>>> the required JMP/MOV/LDX instructions?
>>>
>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>> custom XDP program that calls the kfuncs and puts the data into the
>>> metadata area. We could solve this with some code in libxdp, though; if
>>> this code can be made generic enough (so it just dumps the available
>>> metadata functions from the running kernel at load time), it may be
>>> possible to make it generic enough that it will be forward-compatible
>>> with new versions of the kernel that add new fields, which should
>>> alleviate Florian's concern about keeping things in sync.
>>>
>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>> think the CPUMAP and veth use cases are also quite important, and that
>>> we want metadata to be available for building SKBs in this path. Maybe
>>> this can be resolved by having a convenient kfunc for this that can be
>>> used for programs doing such redirects. E.g., you could just call
>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>> would recursively expand into all the kfunc calls needed to extract the
>>> metadata supported by the SKB path?
>>>
>>> -Toke
>>>

Stanislav Fomichev Nov. 1, 2022, 8:12 p.m. UTC | #22

On Tue, Nov 1, 2022 at 10:31 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
> >>
> >>
> >>
> >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I was closely following this discussion for some time now. Seems we
> >>>> reached the point where it's getting interesting for me.
> >>>>
> >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >>>>>>>> And it's actually harder to abstract away inter HW generation
> >>>>>>>> differences if the user space code has to handle all of it.
> >>>>>>
> >>>>>> I don't see how its any harder in practice though?
> >>>>>
> >>>>> You need to find out what HW/FW/config you're running, right?
> >>>>> And all you have is a pointer to a blob of unknown type.
> >>>>>
> >>>>> Take timestamps for example, some NICs support adjusting the PHC
> >>>>> or doing SW corrections (with different versions of hw/fw/server
> >>>>> platforms being capable of both/one/neither).
> >>>>>
> >>>>> Sure you can extract all this info with tracing and careful
> >>>>> inspection via uAPI. But I don't think that's _easier_.
> >>>>> And the vendors can't run the results thru their validation
> >>>>> (for whatever that's worth).
> >>>>>
> >>>>>>> I've had the same concern:
> >>>>>>>
> >>>>>>> Until we have some userspace library that abstracts all these details,
> >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >>>>>>> of data and I need to go through the code and see what particular type
> >>>>>>> it represents for my particular device and how the data I need is
> >>>>>>> represented there. There are also these "if this is device v1 -> use
> >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
> >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
> >>>>>>> this burden on the driver developers, but I agree that the drawback
> >>>>>>> here is that we actually have to wait for the implementations to catch
> >>>>>>> up.
> >>>>>>
> >>>>>> I agree with everything there, you will get a blob of data and then
> >>>>>> will need to know what field you want to read using BTF. But, we
> >>>>>> already do this for BPF programs all over the place so its not a big
> >>>>>> lift for us. All other BPF tracing/observability requires the same
> >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
> >>>>>> place left to write BPF programs without thinking about BTF and
> >>>>>> kernel data structures.
> >>>>>>
> >>>>>> But, with proposed kptr the complexity lives in userspace and can be
> >>>>>> fixed, added, updated without having to bother with kernel updates, etc.
> >>>>>>   From my point of view of supporting Cilium its a win and much preferred
> >>>>>> to having to deal with driver owners on all cloud vendors, distributions,
> >>>>>> and so on.
> >>>>>>
> >>>>>> If vendor updates firmware with new fields I get those immediately.
> >>>>>
> >>>>> Conversely it's a valid concern that those who *do* actually update
> >>>>> their kernel regularly will have more things to worry about.
> >>>>>
> >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >>>>>>> programs might have to take a lot of other state into consideration
> >>>>>>> when parsing the descriptors; all those details do seem like they
> >>>>>>> belong to the driver code.
> >>>>>>
> >>>>>> I would prefer to avoid being stuck on requiring driver writers to
> >>>>>> be involved. With just a kptr I can support the device and any
> >>>>>> firwmare versions without requiring help.
> >>>>>
> >>>>> 1) where are you getting all those HW / FW specs :S
> >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>>>>
> >>>>>>> Feel free to send it early with just a handful of drivers implemented;
> >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
> >>>>>>> nice sample/test case that shows how the metadata can be used, that
> >>>>>>> might push us closer to the agreement on the best way to proceed.
> >>>>>>
> >>>>>> I'll try to do a intel and mlx implementation to get a cross section.
> >>>>>> I have a good collection of nics here so should be able to show a
> >>>>>> couple firmware versions. It could be fine I think to have the raw
> >>>>>> kptr access and then also kfuncs for some things perhaps.
> >>>>>>
> >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
> >>>>>>>> parsing to user space will indeed result in what you just said - major
> >>>>>>>> vendors are supported and that's it.
> >>>>>>
> >>>>>> I'm not sure about why it would make it harder for new vendors? I think
> >>>>>> the opposite,
> >>>>>
> >>>>> TBH I'm only replying to the email because of the above part :)
> >>>>> I thought this would be self evident, but I guess our perspectives
> >>>>> are different.
> >>>>>
> >>>>> Perhaps you look at it from the perspective of SW running on someone
> >>>>> else's cloud, an being able to move to another cloud, without having
> >>>>> to worry if feature X is available in xdp or just skb.
> >>>>>
> >>>>> I look at it from the perspective of maintaining a cloud, with people
> >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
> >>>>> (superior) startup, and cloud users are messing with raw descriptor -
> >>>>> I'd need to go find every XDP program out there and make sure it
> >>>>> understands the new descriptors.
> >>>>
> >>>> Here is another perspective:
> >>>>
> >>>> As AF_XDP application developer I don't wan't to deal with the
> >>>> underlying hardware in detail. I like to request a feature from the OS
> >>>> (in this case rx/tx timestamping). If the feature is available I will
> >>>> simply use it, if not I might have to work around it - maybe by falling
> >>>> back to SW timestamping.
> >>>>
> >>>> All parts of my application (BPF program included) should not be
> >>>> optimized/adjusted for all the different HW variants out there.
> >>>
> >>> Yes, absolutely agreed. Abstracting away those kinds of hardware
> >>> differences is the whole *point* of having an OS/driver model. I.e.,
> >>> it's what the kernel is there for! If people want to bypass that and get
> >>> direct access to the hardware, they can already do that by using DPDK.
> >>>
> >>> So in other words, 100% agreed that we should not expect the BPF
> >>> developers to deal with hardware details as would be required with a
> >>> kptr-based interface.
> >>>
> >>> As for the kfunc-based interface, I think it shows some promise.
> >>> Exposing a list of function names to retrieve individual metadata items
> >>> instead of a struct layout is sorta comparable in terms of developer UI
> >>> accessibility etc (IMO).
> >>
> >> Looks like there are quite some use cases for hw_timestamp.
> >> Do you think we could add it to the uapi like struct xdp_md?
> >>
> >> The following is the current xdp_md:
> >> struct xdp_md {
> >>           __u32 data;
> >>           __u32 data_end;
> >>           __u32 data_meta;
> >>           /* Below access go through struct xdp_rxq_info */
> >>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
> >>           __u32 rx_queue_index;  /* rxq->queue_index  */
> >>
> >>           __u32 egress_ifindex;  /* txq->dev->ifindex */
> >> };
> >>
> >> We could add  __u64 hw_timestamp to the xdp_md so user
> >> can just do xdp_md->hw_timestamp to get the value.
> >> xdp_md->hw_timestamp == 0 means hw_timestamp is not
> >> available.
> >>
> >> Inside the kernel, the ctx rewriter can generate code
> >> to call driver specific function to retrieve the data.
> >
> > If the driver generates the code to retrieve the data, how's that
> > different from the kfunc approach?
> > The only difference I see is that it would be a more strong UAPI than
> > the kfuncs?
>
> Another thing may be worth considering, some hints for some HW/driver may be
> harder (or may not worth) to unroll/inline.  For example, I see driver is doing
> spin_lock_bh while getting the hwtstamp.  For this case, keep calling a kfunc
> and avoid the unroll/inline may be the right thing to do.

Yeah, I'm trying to look at the drivers right now and doing
spinlocks/seqlocks might complicate the story...
But it seems like in the worst case, the unrolled bytecode can always
resort to calling a kernel function?
(we might need to have some scratch area to preserve r1-r5 and we
can't touch r6-r9 because we are not in a real call, but seems doable;
I'll try to illustrate with a bunch of examples)


> >> The kfunc approach can be used to *less* common use cases?
> >
> > What's the advantage of having two approaches when one can cover
> > common and uncommon cases?
> >
> >>> There are three main drawbacks, AFAICT:
> >>>
> >>> 1. It requires driver developers to write and maintain the code that
> >>> generates the unrolled BPF bytecode to access the metadata fields, which
> >>> is a non-trivial amount of complexity. Maybe this can be abstracted away
> >>> with some internal helpers though (like, e.g., a
> >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> >>> the required JMP/MOV/LDX instructions?
> >>>
> >>> 2. AF_XDP programs won't be able to access the metadata without using a
> >>> custom XDP program that calls the kfuncs and puts the data into the
> >>> metadata area. We could solve this with some code in libxdp, though; if
> >>> this code can be made generic enough (so it just dumps the available
> >>> metadata functions from the running kernel at load time), it may be
> >>> possible to make it generic enough that it will be forward-compatible
> >>> with new versions of the kernel that add new fields, which should
> >>> alleviate Florian's concern about keeping things in sync.
> >>>
> >>> 3. It will make it harder to consume the metadata when building SKBs. I
> >>> think the CPUMAP and veth use cases are also quite important, and that
> >>> we want metadata to be available for building SKBs in this path. Maybe
> >>> this can be resolved by having a convenient kfunc for this that can be
> >>> used for programs doing such redirects. E.g., you could just call
> >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>> would recursively expand into all the kfunc calls needed to extract the
> >>> metadata supported by the SKB path?
> >>>
> >>> -Toke
> >>>
>

Stanislav Fomichev Nov. 1, 2022, 8:12 p.m. UTC | #23

On Tue, Nov 1, 2022 at 10:05 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
> >>>> 2. AF_XDP programs won't be able to access the metadata without using a
> >>>> custom XDP program that calls the kfuncs and puts the data into the
> >>>> metadata area. We could solve this with some code in libxdp, though; if
> >>>> this code can be made generic enough (so it just dumps the available
> >>>> metadata functions from the running kernel at load time), it may be
> >>>> possible to make it generic enough that it will be forward-compatible
> >>>> with new versions of the kernel that add new fields, which should
> >>>> alleviate Florian's concern about keeping things in sync.
> >>>
> >>> Good point. I had to convert to a custom program to use the kfuncs :-(
> >>> But your suggestion sounds good; maybe libxdp can accept some extra
> >>> info about at which offset the user would like to place the metadata
> >>> and the library can generate the required bytecode?
> >>>
> >>>> 3. It will make it harder to consume the metadata when building SKBs. I
> >>>> think the CPUMAP and veth use cases are also quite important, and that
> >>>> we want metadata to be available for building SKBs in this path. Maybe
> >>>> this can be resolved by having a convenient kfunc for this that can be
> >>>> used for programs doing such redirects. E.g., you could just call
> >>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>>> would recursively expand into all the kfunc calls needed to extract the
> >>>> metadata supported by the SKB path?
> >>>
> >>> So this xdp_copy_metadata_for_skb will create a metadata layout that
> >>
> >> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
> >> Not sure where is the best point to specify this prog though.  Somehow during
> >> bpf_xdp_redirect_map?
> >> or this prog belongs to the target cpumap and the xdp prog redirecting to this
> >> cpumap has to write the meta layout in a way that the cpumap is expecting?
> >
> > We're probably interested in triggering it from the places where xdp
> > frames can eventually be converted into skbs?
> > So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> > anything that's not XDP_DROP / AF_XDP redirect).
> > We can probably make it magically work, and can generate
> > kernel-digestible metadata whenever data == data_meta, but the
> > question - should we?
> > (need to make sure we won't regress any existing cases that are not
> > relying on the metadata)
>
> Instead of having some kernel-digestible meta data, how about calling another
> bpf prog to initialize the skb fields from the meta area after
> __xdp_build_skb_from_frame() in the cpumap, so
> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>
> The xdp_prog@rx sets the meta data and then redirect.  If the xdp_prog@rx can
> also specify a xdp prog to initialize the skb fields from the meta area, then
> there is no need to have a kfunc to enforce a kernel-digestible layout.  Not
> sure what is a good way to specify this xdp_prog though...

Not sure also about whether doing it at this point is too late or not?
Need to take a closer look at all __xdp_build_skb_from_frame call sites...
Also see Toke/Dave discussing potentially having helpers to override
some of that metadata. In this case, having more control on the user
side makes sense..

I'll probably start with an explicit helper for now, just to
see if the overall approach is workable. Maybe we can have a follow up
discussion about doing it more transparently.

> >>> the kernel will be able to understand when converting back to skb?
> >>> IIUC, the xdp program will look something like the following:
> >>>
> >>> if (xdp packet is to be consumed by af_xdp) {
> >>>     // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
> >>> own metadata layout
> >>>     return bpf_redirect_map(xsk, ...);
> >>> } else {
> >>>     // if the packet is to be consumed by the kernel
> >>>     xdp_copy_metadata_for_skb(ctx);
> >>>     return bpf_redirect(...);
> >>> }
> >>>
> >>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
> >>> put some magic number in the first byte(s) of the metadata so the
> >>> kernel can check whether xdp_copy_metadata_for_skb has been called
> >>> previously (or maybe xdp_frame can carry this extra signal, idk).
>

Martin KaFai Lau Nov. 1, 2022, 9:17 p.m. UTC | #24

On 11/1/22 1:12 PM, Stanislav Fomichev wrote:
>>>>> As for the kfunc-based interface, I think it shows some promise.
>>>>> Exposing a list of function names to retrieve individual metadata items
>>>>> instead of a struct layout is sorta comparable in terms of developer UI
>>>>> accessibility etc (IMO).
>>>>
>>>> Looks like there are quite some use cases for hw_timestamp.
>>>> Do you think we could add it to the uapi like struct xdp_md?
>>>>
>>>> The following is the current xdp_md:
>>>> struct xdp_md {
>>>>            __u32 data;
>>>>            __u32 data_end;
>>>>            __u32 data_meta;
>>>>            /* Below access go through struct xdp_rxq_info */
>>>>            __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>>>            __u32 rx_queue_index;  /* rxq->queue_index  */
>>>>
>>>>            __u32 egress_ifindex;  /* txq->dev->ifindex */
>>>> };
>>>>
>>>> We could add  __u64 hw_timestamp to the xdp_md so user
>>>> can just do xdp_md->hw_timestamp to get the value.
>>>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>>>> available.
>>>>
>>>> Inside the kernel, the ctx rewriter can generate code
>>>> to call driver specific function to retrieve the data.
>>>
>>> If the driver generates the code to retrieve the data, how's that
>>> different from the kfunc approach?
>>> The only difference I see is that it would be a more strong UAPI than
>>> the kfuncs?
>>
>> Another thing may be worth considering, some hints for some HW/driver may be
>> harder (or may not worth) to unroll/inline.  For example, I see driver is doing
>> spin_lock_bh while getting the hwtstamp.  For this case, keep calling a kfunc
>> and avoid the unroll/inline may be the right thing to do.
> 
> Yeah, I'm trying to look at the drivers right now and doing
> spinlocks/seqlocks might complicate the story...
> But it seems like in the worst case, the unrolled bytecode can always
> resort to calling a kernel function?

unroll the common cases and call kernel function for everything else? that 
should be doable.  The bpf prog calling it as kfunc will have more flexibility 
like this here.

> (we might need to have some scratch area to preserve r1-r5 and we
> can't touch r6-r9 because we are not in a real call, but seems doable;
> I'll try to illustrate with a bunch of examples)
> 
> 
>>>> The kfunc approach can be used to *less* common use cases?
>>>
>>> What's the advantage of having two approaches when one can cover
>>> common and uncommon cases?
>>>
>>>>> There are three main drawbacks, AFAICT:
>>>>>
>>>>> 1. It requires driver developers to write and maintain the code that
>>>>> generates the unrolled BPF bytecode to access the metadata fields, which
>>>>> is a non-trivial amount of complexity. Maybe this can be abstracted away
>>>>> with some internal helpers though (like, e.g., a
>>>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
>>>>> the required JMP/MOV/LDX instructions?
>>>>>
>>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>>> this code can be made generic enough (so it just dumps the available
>>>>> metadata functions from the running kernel at load time), it may be
>>>>> possible to make it generic enough that it will be forward-compatible
>>>>> with new versions of the kernel that add new fields, which should
>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>
>>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>> used for programs doing such redirects. E.g., you could just call
>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>> would recursively expand into all the kfunc calls needed to extract the
>>>>> metadata supported by the SKB path?
>>>>>
>>>>> -Toke
>>>>>
>>

Jesper Dangaard Brouer Nov. 2, 2022, 2:06 p.m. UTC | #25

On 01/11/2022 18.05, Martin KaFai Lau wrote:
> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau 
>> <martin.lau@linux.dev> wrote:
>>>
>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>> 2. AF_XDP programs won't be able to access the metadata without 
>>>>> using a
>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>> metadata area. We could solve this with some code in libxdp, 
>>>>> though; if
>>>>> this code can be made generic enough (so it just dumps the available
>>>>> metadata functions from the running kernel at load time), it may be
>>>>> possible to make it generic enough that it will be forward-compatible
>>>>> with new versions of the kernel that add new fields, which should
>>>>> alleviate Florian's concern about keeping things in sync.
>>>>
>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>> info about at which offset the user would like to place the metadata
>>>> and the library can generate the required bytecode?
>>>>
>>>>> 3. It will make it harder to consume the metadata when building 
>>>>> SKBs. I
>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>> used for programs doing such redirects. E.g., you could just call
>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>> would recursively expand into all the kfunc calls needed to extract 
>>>>> the
>>>>> metadata supported by the SKB path?
>>>>
>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>
>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>> Not sure where is the best point to specify this prog though.  
>>> Somehow during
>>> bpf_xdp_redirect_map?
>>> or this prog belongs to the target cpumap and the xdp prog 
>>> redirecting to this
>>> cpumap has to write the meta layout in a way that the cpumap is 
>>> expecting?
>>
>> We're probably interested in triggering it from the places where xdp
>> frames can eventually be converted into skbs?
>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>> anything that's not XDP_DROP / AF_XDP redirect).
>> We can probably make it magically work, and can generate
>> kernel-digestible metadata whenever data == data_meta, but the
>> question - should we?
>> (need to make sure we won't regress any existing cases that are not
>> relying on the metadata)
> 
> Instead of having some kernel-digestible meta data, how about calling 
> another bpf prog to initialize the skb fields from the meta area after 
> __xdp_build_skb_from_frame() in the cpumap, so 
> run_xdp_set_skb_fileds_from_metadata() may be a better name.
> 

I very much like this idea of calling another bpf prog to initialize the
SKB fields from the meta area. (As a reminder, data need to come from
meta area, because at this point the hardware RX-desc is out-of-scope).
I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.

We could invoke this BPF-prog inside __xdp_build_skb_from_frame().

We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
context inputs. Right?  (Not sure, if this is acceptable with the BPF
maintainers new rules)

> The xdp_prog@rx sets the meta data and then redirect.  If the 
> xdp_prog@rx can also specify a xdp prog to initialize the skb fields 
> from the meta area, then there is no need to have a kfunc to enforce a 
> kernel-digestible layout.  Not sure what is a good way to specify this 
> xdp_prog though...

The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
__xdp_build_skb_from_frame() is that it need to know howto decode the
meta area for every device driver or XDP-prog populating this (as veth
and cpumap can get redirected packets from multiple device drivers).
Sure, using a common function/helper/macro like
xdp_copy_metadata_for_skb() could help reduce this multiplexing, but we
want to have maximum flexibility to extend this without having to update
the kernel, right.

Fortunately __xdp_build_skb_from_frame() have a net_device parameter,
that points to the device is was received on (xdp_frame->dev_rx).
Thus, we could extend net_device and add this BPF-prog on a per
net_device basis.  This could function as a driver BPF-prog callback
that populates the SKB fields from the XDP meta data.
Is this a good or bad idea?


>>>> the kernel will be able to understand when converting back to skb?
>>>> IIUC, the xdp program will look something like the following:
>>>>
>>>> if (xdp packet is to be consumed by af_xdp) {
>>>>     // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble 
>>>> your
>>>> own metadata layout
>>>>     return bpf_redirect_map(xsk, ...);
>>>> } else {
>>>>     // if the packet is to be consumed by the kernel
>>>>     xdp_copy_metadata_for_skb(ctx);
>>>>     return bpf_redirect(...);
>>>> }
>>>>
>>>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
>>>> put some magic number in the first byte(s) of the metadata so the
>>>> kernel can check whether xdp_copy_metadata_for_skb has been called
>>>> previously (or maybe xdp_frame can carry this extra signal, idk).

I'm in favor of adding a flag bit to xdp_frame to signal this.
In __xdp_build_skb_from_frame() we could check this flag signal and then
invoke the BPF-prog type BPF_PROG_TYPE_XDP2SKB.

--Jesper

Toke Høiland-Jørgensen Nov. 2, 2022, 10:01 p.m. UTC | #26

Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau 
>>> <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>> 2. AF_XDP programs won't be able to access the metadata without 
>>>>>> using a
>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>> metadata area. We could solve this with some code in libxdp, 
>>>>>> though; if
>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>> with new versions of the kernel that add new fields, which should
>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>
>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>> info about at which offset the user would like to place the metadata
>>>>> and the library can generate the required bytecode?
>>>>>
>>>>>> 3. It will make it harder to consume the metadata when building 
>>>>>> SKBs. I
>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>> would recursively expand into all the kfunc calls needed to extract 
>>>>>> the
>>>>>> metadata supported by the SKB path?
>>>>>
>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>
>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>> Not sure where is the best point to specify this prog though.  
>>>> Somehow during
>>>> bpf_xdp_redirect_map?
>>>> or this prog belongs to the target cpumap and the xdp prog 
>>>> redirecting to this
>>>> cpumap has to write the meta layout in a way that the cpumap is 
>>>> expecting?
>>>
>>> We're probably interested in triggering it from the places where xdp
>>> frames can eventually be converted into skbs?
>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>> anything that's not XDP_DROP / AF_XDP redirect).
>>> We can probably make it magically work, and can generate
>>> kernel-digestible metadata whenever data == data_meta, but the
>>> question - should we?
>>> (need to make sure we won't regress any existing cases that are not
>>> relying on the metadata)
>> 
>> Instead of having some kernel-digestible meta data, how about calling 
>> another bpf prog to initialize the skb fields from the meta area after 
>> __xdp_build_skb_from_frame() in the cpumap, so 
>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>> 
>
> I very much like this idea of calling another bpf prog to initialize the
> SKB fields from the meta area. (As a reminder, data need to come from
> meta area, because at this point the hardware RX-desc is out-of-scope).
> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>
> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>
> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
> context inputs. Right?  (Not sure, if this is acceptable with the BPF
> maintainers new rules)
>
>> The xdp_prog@rx sets the meta data and then redirect.  If the 
>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields 
>> from the meta area, then there is no need to have a kfunc to enforce a 
>> kernel-digestible layout.  Not sure what is a good way to specify this 
>> xdp_prog though...
>
> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
> __xdp_build_skb_from_frame() is that it need to know howto decode the
> meta area for every device driver or XDP-prog populating this (as veth
> and cpumap can get redirected packets from multiple device drivers).

If we have the helper to copy the data "out of" the drivers, why do we
need a second BPF program to copy data to the SKB?

I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
each of the kfuncs needed for the metadata used by SKBs, all of which
get unrolled. The helper takes the output of these metadata-extracting
kfuncs and stores it "somewhere". This "somewhere" could well be the
metadata area; but in any case, since it's hidden away inside a helper
(or kfunc) from the calling XDP program's PoV, the helper can just stash
all the data in a fixed format, which __xdp_build_skb_from_frame() can
then just read statically. We could even make this format match the
field layout of struct sk_buff, so all we have to do is memcpy a
contiguous chunk of memory when building the SKB.

> Sure, using a common function/helper/macro like
> xdp_copy_metadata_for_skb() could help reduce this multiplexing, but
> we want to have maximum flexibility to extend this without having to
> update the kernel, right.

The extension mechanism is in which kfuncs are available to XDP programs
to extract metadata. The kernel then just becomes another consumer of
those kfuncs, by way of the xdp_copy_metadata_for_skb(); but there could
also be other kfuncs added that are not used for skbs (even
vendor-specific ones if we want to allow that).

-Toke

Stanislav Fomichev Nov. 2, 2022, 11:10 p.m. UTC | #27

On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>
> > On 01/11/2022 18.05, Martin KaFai Lau wrote:
> >> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
> >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
> >>> <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
> >>>>>> 2. AF_XDP programs won't be able to access the metadata without
> >>>>>> using a
> >>>>>> custom XDP program that calls the kfuncs and puts the data into the
> >>>>>> metadata area. We could solve this with some code in libxdp,
> >>>>>> though; if
> >>>>>> this code can be made generic enough (so it just dumps the available
> >>>>>> metadata functions from the running kernel at load time), it may be
> >>>>>> possible to make it generic enough that it will be forward-compatible
> >>>>>> with new versions of the kernel that add new fields, which should
> >>>>>> alleviate Florian's concern about keeping things in sync.
> >>>>>
> >>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
> >>>>> But your suggestion sounds good; maybe libxdp can accept some extra
> >>>>> info about at which offset the user would like to place the metadata
> >>>>> and the library can generate the required bytecode?
> >>>>>
> >>>>>> 3. It will make it harder to consume the metadata when building
> >>>>>> SKBs. I
> >>>>>> think the CPUMAP and veth use cases are also quite important, and that
> >>>>>> we want metadata to be available for building SKBs in this path. Maybe
> >>>>>> this can be resolved by having a convenient kfunc for this that can be
> >>>>>> used for programs doing such redirects. E.g., you could just call
> >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>>>>> would recursively expand into all the kfunc calls needed to extract
> >>>>>> the
> >>>>>> metadata supported by the SKB path?
> >>>>>
> >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
> >>>>
> >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
> >>>> Not sure where is the best point to specify this prog though.
> >>>> Somehow during
> >>>> bpf_xdp_redirect_map?
> >>>> or this prog belongs to the target cpumap and the xdp prog
> >>>> redirecting to this
> >>>> cpumap has to write the meta layout in a way that the cpumap is
> >>>> expecting?
> >>>
> >>> We're probably interested in triggering it from the places where xdp
> >>> frames can eventually be converted into skbs?
> >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> >>> anything that's not XDP_DROP / AF_XDP redirect).
> >>> We can probably make it magically work, and can generate
> >>> kernel-digestible metadata whenever data == data_meta, but the
> >>> question - should we?
> >>> (need to make sure we won't regress any existing cases that are not
> >>> relying on the metadata)
> >>
> >> Instead of having some kernel-digestible meta data, how about calling
> >> another bpf prog to initialize the skb fields from the meta area after
> >> __xdp_build_skb_from_frame() in the cpumap, so
> >> run_xdp_set_skb_fileds_from_metadata() may be a better name.
> >>
> >
> > I very much like this idea of calling another bpf prog to initialize the
> > SKB fields from the meta area. (As a reminder, data need to come from
> > meta area, because at this point the hardware RX-desc is out-of-scope).
> > I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
> >
> > We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
> >
> > We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
> > run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
> > context inputs. Right?  (Not sure, if this is acceptable with the BPF
> > maintainers new rules)
> >
> >> The xdp_prog@rx sets the meta data and then redirect.  If the
> >> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
> >> from the meta area, then there is no need to have a kfunc to enforce a
> >> kernel-digestible layout.  Not sure what is a good way to specify this
> >> xdp_prog though...
> >
> > The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
> > __xdp_build_skb_from_frame() is that it need to know howto decode the
> > meta area for every device driver or XDP-prog populating this (as veth
> > and cpumap can get redirected packets from multiple device drivers).
>
> If we have the helper to copy the data "out of" the drivers, why do we
> need a second BPF program to copy data to the SKB?
>
> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
> each of the kfuncs needed for the metadata used by SKBs, all of which
> get unrolled. The helper takes the output of these metadata-extracting
> kfuncs and stores it "somewhere". This "somewhere" could well be the
> metadata area; but in any case, since it's hidden away inside a helper
> (or kfunc) from the calling XDP program's PoV, the helper can just stash
> all the data in a fixed format, which __xdp_build_skb_from_frame() can
> then just read statically. We could even make this format match the
> field layout of struct sk_buff, so all we have to do is memcpy a
> contiguous chunk of memory when building the SKB.

+1

I'm currently doing exactly what you're suggesting (minus matching skb layout):

struct xdp_to_skb_metadata {
  u32 magic; // randomized at boot
  ... skb-consumable-metadata in fixed format
} __randomize_layout;

bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
-sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
to fill in the actual data.

Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
C code that parses that 'struct xdp_to_skb_metadata'.
(To be precise, I'm trying to parse the metadata from
skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
100% sure that's the right place).
(I also randomize the layout and magic to make sure userspace doesn't
depend on it because nothing stops this packet to be routed into xsk
socket..)

> > Sure, using a common function/helper/macro like
> > xdp_copy_metadata_for_skb() could help reduce this multiplexing, but
> > we want to have maximum flexibility to extend this without having to
> > update the kernel, right.
>
> The extension mechanism is in which kfuncs are available to XDP programs
> to extract metadata. The kernel then just becomes another consumer of
> those kfuncs, by way of the xdp_copy_metadata_for_skb(); but there could
> also be other kfuncs added that are not used for skbs (even
> vendor-specific ones if we want to allow that).
>
> -Toke
>

Toke Høiland-Jørgensen Nov. 3, 2022, 12:09 a.m. UTC | #28

Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>
>> > On 01/11/2022 18.05, Martin KaFai Lau wrote:
>> >> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>> >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>> >>> <martin.lau@linux.dev> wrote:
>> >>>>
>> >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>> >>>>>> 2. AF_XDP programs won't be able to access the metadata without
>> >>>>>> using a
>> >>>>>> custom XDP program that calls the kfuncs and puts the data into the
>> >>>>>> metadata area. We could solve this with some code in libxdp,
>> >>>>>> though; if
>> >>>>>> this code can be made generic enough (so it just dumps the available
>> >>>>>> metadata functions from the running kernel at load time), it may be
>> >>>>>> possible to make it generic enough that it will be forward-compatible
>> >>>>>> with new versions of the kernel that add new fields, which should
>> >>>>>> alleviate Florian's concern about keeping things in sync.
>> >>>>>
>> >>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>> >>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>> >>>>> info about at which offset the user would like to place the metadata
>> >>>>> and the library can generate the required bytecode?
>> >>>>>
>> >>>>>> 3. It will make it harder to consume the metadata when building
>> >>>>>> SKBs. I
>> >>>>>> think the CPUMAP and veth use cases are also quite important, and that
>> >>>>>> we want metadata to be available for building SKBs in this path. Maybe
>> >>>>>> this can be resolved by having a convenient kfunc for this that can be
>> >>>>>> used for programs doing such redirects. E.g., you could just call
>> >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>> >>>>>> would recursively expand into all the kfunc calls needed to extract
>> >>>>>> the
>> >>>>>> metadata supported by the SKB path?
>> >>>>>
>> >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>> >>>>
>> >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>> >>>> Not sure where is the best point to specify this prog though.
>> >>>> Somehow during
>> >>>> bpf_xdp_redirect_map?
>> >>>> or this prog belongs to the target cpumap and the xdp prog
>> >>>> redirecting to this
>> >>>> cpumap has to write the meta layout in a way that the cpumap is
>> >>>> expecting?
>> >>>
>> >>> We're probably interested in triggering it from the places where xdp
>> >>> frames can eventually be converted into skbs?
>> >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>> >>> anything that's not XDP_DROP / AF_XDP redirect).
>> >>> We can probably make it magically work, and can generate
>> >>> kernel-digestible metadata whenever data == data_meta, but the
>> >>> question - should we?
>> >>> (need to make sure we won't regress any existing cases that are not
>> >>> relying on the metadata)
>> >>
>> >> Instead of having some kernel-digestible meta data, how about calling
>> >> another bpf prog to initialize the skb fields from the meta area after
>> >> __xdp_build_skb_from_frame() in the cpumap, so
>> >> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>> >>
>> >
>> > I very much like this idea of calling another bpf prog to initialize the
>> > SKB fields from the meta area. (As a reminder, data need to come from
>> > meta area, because at this point the hardware RX-desc is out-of-scope).
>> > I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>> >
>> > We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>> >
>> > We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>> > run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>> > context inputs. Right?  (Not sure, if this is acceptable with the BPF
>> > maintainers new rules)
>> >
>> >> The xdp_prog@rx sets the meta data and then redirect.  If the
>> >> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>> >> from the meta area, then there is no need to have a kfunc to enforce a
>> >> kernel-digestible layout.  Not sure what is a good way to specify this
>> >> xdp_prog though...
>> >
>> > The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>> > __xdp_build_skb_from_frame() is that it need to know howto decode the
>> > meta area for every device driver or XDP-prog populating this (as veth
>> > and cpumap can get redirected packets from multiple device drivers).
>>
>> If we have the helper to copy the data "out of" the drivers, why do we
>> need a second BPF program to copy data to the SKB?
>>
>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>> each of the kfuncs needed for the metadata used by SKBs, all of which
>> get unrolled. The helper takes the output of these metadata-extracting
>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>> metadata area; but in any case, since it's hidden away inside a helper
>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>> then just read statically. We could even make this format match the
>> field layout of struct sk_buff, so all we have to do is memcpy a
>> contiguous chunk of memory when building the SKB.
>
> +1
>
> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>
> struct xdp_to_skb_metadata {
>   u32 magic; // randomized at boot
>   ... skb-consumable-metadata in fixed format
> } __randomize_layout;
>
> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
> to fill in the actual data.
>
> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
> C code that parses that 'struct xdp_to_skb_metadata'.
> (To be precise, I'm trying to parse the metadata from
> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
> 100% sure that's the right place).
> (I also randomize the layout and magic to make sure userspace doesn't
> depend on it because nothing stops this packet to be routed into xsk
> socket..)

Ah, nice trick with __randomize_layout - I agree we need to do something
to prevent userspace from inadvertently starting to rely on this, and
this seems like a great solution!

Look forward to seeing what the whole thing looks like in a more
complete form :)

-Toke

Jesper Dangaard Brouer Nov. 3, 2022, 12:01 p.m. UTC | #29

On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>>
>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>>>>>> <martin.lau@linux.dev> wrote:
>>>>>>>
>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without
>>>>>>>>> using a
>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>>>>> metadata area. We could solve this with some code in libxdp,
>>>>>>>>> though; if
>>>>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>>>>> with new versions of the kernel that add new fields, which should
>>>>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>>>>
>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>>>>> info about at which offset the user would like to place the metadata
>>>>>>>> and the library can generate the required bytecode?
>>>>>>>>
>>>>>>>>> 3. It will make it harder to consume the metadata when building
>>>>>>>>> SKBs. I
>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>>>>> would recursively expand into all the kfunc calls needed to extract
>>>>>>>>> the
>>>>>>>>> metadata supported by the SKB path?
>>>>>>>>
>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>>>>
>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>>>>> Not sure where is the best point to specify this prog though.
>>>>>>> Somehow during
>>>>>>> bpf_xdp_redirect_map?
>>>>>>> or this prog belongs to the target cpumap and the xdp prog
>>>>>>> redirecting to this
>>>>>>> cpumap has to write the meta layout in a way that the cpumap is
>>>>>>> expecting?
>>>>>>
>>>>>> We're probably interested in triggering it from the places where xdp
>>>>>> frames can eventually be converted into skbs?
>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>>>>> anything that's not XDP_DROP / AF_XDP redirect).
>>>>>> We can probably make it magically work, and can generate
>>>>>> kernel-digestible metadata whenever data == data_meta, but the
>>>>>> question - should we?
>>>>>> (need to make sure we won't regress any existing cases that are not
>>>>>> relying on the metadata)
>>>>>
>>>>> Instead of having some kernel-digestible meta data, how about calling
>>>>> another bpf prog to initialize the skb fields from the meta area after
>>>>> __xdp_build_skb_from_frame() in the cpumap, so
>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>>>>>
>>>>
>>>> I very much like this idea of calling another bpf prog to initialize the
>>>> SKB fields from the meta area. (As a reminder, data need to come from
>>>> meta area, because at this point the hardware RX-desc is out-of-scope).
>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>>>>
>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>>>>
>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>>>> context inputs. Right?  (Not sure, if this is acceptable with the BPF
>>>> maintainers new rules)
>>>>
>>>>> The xdp_prog@rx sets the meta data and then redirect.  If the
>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>>>>> from the meta area, then there is no need to have a kfunc to enforce a
>>>>> kernel-digestible layout.  Not sure what is a good way to specify this
>>>>> xdp_prog though...
>>>>
>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the
>>>> meta area for every device driver or XDP-prog populating this (as veth
>>>> and cpumap can get redirected packets from multiple device drivers).
>>>
>>> If we have the helper to copy the data "out of" the drivers, why do we
>>> need a second BPF program to copy data to the SKB?
>>>

IMHO the second BPF program to populate the SKB is needed to add
flexibility and extensibility.

My end-goal here is to speedup packet parsing.
This BPF-prog should (in time) be able to update skb->transport_header
and skb->network_header.  As I mentioned before, HW RX-hash already tell
us the L3 and L4 protocols and in-most-cases header-len.  Even without
HW-hints, the XDP-prog likely have parsed the packet once. This parse
information is lost today, and redone by netstack. What about storing
this header parse info in meta data and re-use in this new XDP2SKB hook?

The reason for suggesting this BPF-prog to be a callback, associated
with the net_device, were that HW is going to differ on what HW hints
that HW support.  Thus, we can avoid a generic C-function that need to
check for all the possible hints, and instead have a BPF-prog that only
contains the code that is relevant for this net_device.


>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>>> each of the kfuncs needed for the metadata used by SKBs, all of which
>>> get unrolled. The helper takes the output of these metadata-extracting
>>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>>> metadata area; but in any case, since it's hidden away inside a helper
>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>>> then just read statically. We could even make this format match the
>>> field layout of struct sk_buff, so all we have to do is memcpy a
>>> contiguous chunk of memory when building the SKB.
>>
>> +1

Sorry, I think this "hiding" layout trick is going in a wrong direction.

Imagine the use-case of cpumap redirect.  The physical device XDP-prog
calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it
calls redirect into cpumap.  On remote CPU, the xdp_frame is picked up,
and then I want to run another XDP-prog that want to look at these
HW-hints, and then likely call XDP_PASS which creates the SKB, also
using these HW-hints.  I take it, it would not be possible when using
the xdp_copy_metadata_for_skb() helper?

>>
>> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>>
>> struct xdp_to_skb_metadata {
>>    u32 magic; // randomized at boot
>>    ... skb-consumable-metadata in fixed format
>> } __randomize_layout;
>>
>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
>> to fill in the actual data.
>>
>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
>> C code that parses that 'struct xdp_to_skb_metadata'.
>> (To be precise, I'm trying to parse the metadata from
>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
>> 100% sure that's the right place).
>> (I also randomize the layout and magic to make sure userspace doesn't
>> depend on it because nothing stops this packet to be routed into xsk
>> socket..)
> 
> Ah, nice trick with __randomize_layout - I agree we need to do something
> to prevent userspace from inadvertently starting to rely on this, and
> this seems like a great solution!

Sorry, I disagree where this is going.  Why do all of a sudden want to
prevent userspace (e.g. AF_XDP) from using this data?!?

The hole exercise started with wanting to provide AF_XDP with these
HW-hints. The hints a standard AF_XDP user wants is likely very similar
to what the SKB user wants.  Why do the AF_XDP user need to open code this?

The BTF-ID scheme precisely allows us to expose this layout to
userspace, and at the same time have freedom to change this in kernel
space, as userspace must decode the BTF-layout before reading this.
I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID
scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to
ask for?  You can just consider the BTF-ID as the magic number, as it
will be more-or-less random per kernel build (and module load order).

> Look forward to seeing what the whole thing looks like in a more
> complete form :)

I'm sort of on-board with the kfuncs and unroll-tricks, if I can see
some driver code that handles the issues of getting HW setup state
exposed needed to decode the RX-desc format.

I sense that I myself, haven't been good enough to explain/convey the
BTF-ID scheme.  Next week, I will code some examples that demo how
BTF-IDs can be used from BPF-progs, even as a communication channel
between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog ->
TC-BPF).

--Jesper

Toke Høiland-Jørgensen Nov. 3, 2022, 12:48 p.m. UTC | #30

Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>>>
>>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>>>>>>> <martin.lau@linux.dev> wrote:
>>>>>>>>
>>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without
>>>>>>>>>> using a
>>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>>>>>> metadata area. We could solve this with some code in libxdp,
>>>>>>>>>> though; if
>>>>>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>>>>>> with new versions of the kernel that add new fields, which should
>>>>>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>>>>>
>>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>>>>>> info about at which offset the user would like to place the metadata
>>>>>>>>> and the library can generate the required bytecode?
>>>>>>>>>
>>>>>>>>>> 3. It will make it harder to consume the metadata when building
>>>>>>>>>> SKBs. I
>>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>>>>>> would recursively expand into all the kfunc calls needed to extract
>>>>>>>>>> the
>>>>>>>>>> metadata supported by the SKB path?
>>>>>>>>>
>>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>>>>>
>>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>>>>>> Not sure where is the best point to specify this prog though.
>>>>>>>> Somehow during
>>>>>>>> bpf_xdp_redirect_map?
>>>>>>>> or this prog belongs to the target cpumap and the xdp prog
>>>>>>>> redirecting to this
>>>>>>>> cpumap has to write the meta layout in a way that the cpumap is
>>>>>>>> expecting?
>>>>>>>
>>>>>>> We're probably interested in triggering it from the places where xdp
>>>>>>> frames can eventually be converted into skbs?
>>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>>>>>> anything that's not XDP_DROP / AF_XDP redirect).
>>>>>>> We can probably make it magically work, and can generate
>>>>>>> kernel-digestible metadata whenever data == data_meta, but the
>>>>>>> question - should we?
>>>>>>> (need to make sure we won't regress any existing cases that are not
>>>>>>> relying on the metadata)
>>>>>>
>>>>>> Instead of having some kernel-digestible meta data, how about calling
>>>>>> another bpf prog to initialize the skb fields from the meta area after
>>>>>> __xdp_build_skb_from_frame() in the cpumap, so
>>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>>>>>>
>>>>>
>>>>> I very much like this idea of calling another bpf prog to initialize the
>>>>> SKB fields from the meta area. (As a reminder, data need to come from
>>>>> meta area, because at this point the hardware RX-desc is out-of-scope).
>>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>>>>>
>>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>>>>>
>>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>>>>> context inputs. Right?  (Not sure, if this is acceptable with the BPF
>>>>> maintainers new rules)
>>>>>
>>>>>> The xdp_prog@rx sets the meta data and then redirect.  If the
>>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>>>>>> from the meta area, then there is no need to have a kfunc to enforce a
>>>>>> kernel-digestible layout.  Not sure what is a good way to specify this
>>>>>> xdp_prog though...
>>>>>
>>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the
>>>>> meta area for every device driver or XDP-prog populating this (as veth
>>>>> and cpumap can get redirected packets from multiple device drivers).
>>>>
>>>> If we have the helper to copy the data "out of" the drivers, why do we
>>>> need a second BPF program to copy data to the SKB?
>>>>
>
> IMHO the second BPF program to populate the SKB is needed to add
> flexibility and extensibility.
>
> My end-goal here is to speedup packet parsing.
> This BPF-prog should (in time) be able to update skb->transport_header
> and skb->network_header.  As I mentioned before, HW RX-hash already tell
> us the L3 and L4 protocols and in-most-cases header-len.  Even without
> HW-hints, the XDP-prog likely have parsed the packet once. This parse
> information is lost today, and redone by netstack. What about storing
> this header parse info in meta data and re-use in this new XDP2SKB hook?
>
> The reason for suggesting this BPF-prog to be a callback, associated
> with the net_device, were that HW is going to differ on what HW hints
> that HW support.  Thus, we can avoid a generic C-function that need to
> check for all the possible hints, and instead have a BPF-prog that only
> contains the code that is relevant for this net_device.

But that's exactly what the xdp_copy_metadata_for_skb() is! It's a
dynamic "BPF program" (generated as unrolled kfunc calls) just running
in the helper and stashing the results in an intermediate struct in the
metadata area. And once it's done that, we don't need *another* dynamic
BPF program to read it back out and populate the SKB, because the
intermediate format it's been stashed into is under the control of the
kernel (we just need a flag to indicate that it's there).

>>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>>>> each of the kfuncs needed for the metadata used by SKBs, all of which
>>>> get unrolled. The helper takes the output of these metadata-extracting
>>>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>>>> metadata area; but in any case, since it's hidden away inside a helper
>>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>>>> then just read statically. We could even make this format match the
>>>> field layout of struct sk_buff, so all we have to do is memcpy a
>>>> contiguous chunk of memory when building the SKB.
>>>
>>> +1
>
> Sorry, I think this "hiding" layout trick is going in a wrong direction.
>
> Imagine the use-case of cpumap redirect.  The physical device XDP-prog
> calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it
> calls redirect into cpumap.  On remote CPU, the xdp_frame is picked up,
> and then I want to run another XDP-prog that want to look at these
> HW-hints, and then likely call XDP_PASS which creates the SKB, also
> using these HW-hints.  I take it, it would not be possible when using
> the xdp_copy_metadata_for_skb() helper?

You're right that it should be possible to read the values back out
again later. That is totally possible with this scheme, though; the
'xdp_to_skb_metadata' is going to be in the vmlinux BTF, so an XDP
program can just read that. We can explicitly support it by using the
BTF ID as the "magic value" as you suggest, which would be fine by me.

I still think we should be using the __randomize_layout trick, though,
precisely so that BPF consumers are forced to use BTF relocations to
read it; otherwise we risk the struct layout ossifying into UAPI because
people are just going to assume it's static...

>>> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>>>
>>> struct xdp_to_skb_metadata {
>>>    u32 magic; // randomized at boot
>>>    ... skb-consumable-metadata in fixed format
>>> } __randomize_layout;
>>>
>>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
>>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
>>> to fill in the actual data.
>>>
>>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
>>> C code that parses that 'struct xdp_to_skb_metadata'.
>>> (To be precise, I'm trying to parse the metadata from
>>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
>>> 100% sure that's the right place).
>>> (I also randomize the layout and magic to make sure userspace doesn't
>>> depend on it because nothing stops this packet to be routed into xsk
>>> socket..)
>> 
>> Ah, nice trick with __randomize_layout - I agree we need to do something
>> to prevent userspace from inadvertently starting to rely on this, and
>> this seems like a great solution!
>
> Sorry, I disagree where this is going.  Why do all of a sudden want to
> prevent userspace (e.g. AF_XDP) from using this data?!?

See above: I don't think we should prevent userspace from using it (and
we're not), but we should prevent the struct layout from ossifying.

> The hole exercise started with wanting to provide AF_XDP with these
> HW-hints. The hints a standard AF_XDP user wants is likely very
> similar to what the SKB user wants. Why do the AF_XDP user need to
> open code this?
>
> The BTF-ID scheme precisely allows us to expose this layout to
> userspace, and at the same time have freedom to change this in kernel
> space, as userspace must decode the BTF-layout before reading this.
> I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID
> scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to
> ask for?  You can just consider the BTF-ID as the magic number, as it
> will be more-or-less random per kernel build (and module load order).

As mentioned above, I would be totally fine with just having the
xdp_to_skb_metadata be part of BTF, enabling both XDP programs and
AF_XDP consumers to re-use it.

>> Look forward to seeing what the whole thing looks like in a more
>> complete form :)
>
> I'm sort of on-board with the kfuncs and unroll-tricks, if I can see
> some driver code that handles the issues of getting HW setup state
> exposed needed to decode the RX-desc format.
>
> I sense that I myself, haven't been good enough to explain/convey the
> BTF-ID scheme.  Next week, I will code some examples that demo how
> BTF-IDs can be used from BPF-progs, even as a communication channel
> between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog ->
> TC-BPF).

For my part at least, it's not a lack of understanding that makes me
prefer the kfunc approach. Rather, it's the complexity of having to
resolve the multiple BTF IDs, and the risk of ossifying the struct
layouts because people are going to do that wrong. Using kfuncs gives us
much more control of the API, especially if we combine it with struct
randomisation for the bits we do expose.

Translating what we've discussed above into the terms used in your patch
series, this would correspond to *only* having the xdp_metadata_common
struct exposed via BTF, and not bothering with all the other
driver-specific layouts. So an XDP/AF_XDP user that only wants to use
the metadata that's also used by the stack can just call
xdp_copy_metadata_for_skb(), and then read the resulting metadata area
(using BTF). And if someone wants to access metadata that's *not* used
by the stack, they'd have to call additional kfuncs to extract that.

And similarly, if someone wants only a subset of the metadata used by an
SKB, they can just *not* call xdp_copy_metadata_for_skb(), and instead
just call the individual kfuncs to extract just the fields they want.

I think this strikes a nice balance between the flexibility by the
kernel to change things, the flexibility of XDP consumers to request
only the data they want, and the ability for the same metadata to be
consumed at different points. WDYT?

-Toke

Jesper Dangaard Brouer Nov. 3, 2022, 3:25 p.m. UTC | #31

On 03/11/2022 13.48, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
> 
>> On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote:
>>> Stanislav Fomichev <sdf@google.com> writes:
>>>
>>>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>>
>>>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>>>>
>>>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>>>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>>>>>>>> <martin.lau@linux.dev> wrote:
>>>>>>>>>
>>>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without
>>>>>>>>>>> using a
>>>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>>>>>>> metadata area. We could solve this with some code in libxdp,
>>>>>>>>>>> though; if
>>>>>>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>>>>>>> with new versions of the kernel that add new fields, which should
>>>>>>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>>>>>>
>>>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>>>>>>> info about at which offset the user would like to place the metadata
>>>>>>>>>> and the library can generate the required bytecode?
>>>>>>>>>>
>>>>>>>>>>> 3. It will make it harder to consume the metadata when building
>>>>>>>>>>> SKBs. I
>>>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>>>>>>> would recursively expand into all the kfunc calls needed to extract
>>>>>>>>>>> the
>>>>>>>>>>> metadata supported by the SKB path?
>>>>>>>>>>
>>>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>>>>>>
>>>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>>>>>>> Not sure where is the best point to specify this prog though.
>>>>>>>>> Somehow during
>>>>>>>>> bpf_xdp_redirect_map?
>>>>>>>>> or this prog belongs to the target cpumap and the xdp prog
>>>>>>>>> redirecting to this
>>>>>>>>> cpumap has to write the meta layout in a way that the cpumap is
>>>>>>>>> expecting?
>>>>>>>>
>>>>>>>> We're probably interested in triggering it from the places where xdp
>>>>>>>> frames can eventually be converted into skbs?
>>>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>>>>>>> anything that's not XDP_DROP / AF_XDP redirect).
>>>>>>>> We can probably make it magically work, and can generate
>>>>>>>> kernel-digestible metadata whenever data == data_meta, but the
>>>>>>>> question - should we?
>>>>>>>> (need to make sure we won't regress any existing cases that are not
>>>>>>>> relying on the metadata)
>>>>>>>
>>>>>>> Instead of having some kernel-digestible meta data, how about calling
>>>>>>> another bpf prog to initialize the skb fields from the meta area after
>>>>>>> __xdp_build_skb_from_frame() in the cpumap, so
>>>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>>>>>>>
>>>>>>
>>>>>> I very much like this idea of calling another bpf prog to initialize the
>>>>>> SKB fields from the meta area. (As a reminder, data need to come from
>>>>>> meta area, because at this point the hardware RX-desc is out-of-scope).
>>>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>>>>>>
>>>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>>>>>>
>>>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>>>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>>>>>> context inputs. Right?  (Not sure, if this is acceptable with the BPF
>>>>>> maintainers new rules)
>>>>>>
>>>>>>> The xdp_prog@rx sets the meta data and then redirect.  If the
>>>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>>>>>>> from the meta area, then there is no need to have a kfunc to enforce a
>>>>>>> kernel-digestible layout.  Not sure what is a good way to specify this
>>>>>>> xdp_prog though...
>>>>>>
>>>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>>>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the
>>>>>> meta area for every device driver or XDP-prog populating this (as veth
>>>>>> and cpumap can get redirected packets from multiple device drivers).
>>>>>
>>>>> If we have the helper to copy the data "out of" the drivers, why do we
>>>>> need a second BPF program to copy data to the SKB?
>>>>>
>>
>> IMHO the second BPF program to populate the SKB is needed to add
>> flexibility and extensibility.
>>
>> My end-goal here is to speedup packet parsing.
>> This BPF-prog should (in time) be able to update skb->transport_header
>> and skb->network_header.  As I mentioned before, HW RX-hash already tell
>> us the L3 and L4 protocols and in-most-cases header-len.  Even without
>> HW-hints, the XDP-prog likely have parsed the packet once. This parse
>> information is lost today, and redone by netstack. What about storing
>> this header parse info in meta data and re-use in this new XDP2SKB hook?
>>
>> The reason for suggesting this BPF-prog to be a callback, associated
>> with the net_device, were that HW is going to differ on what HW hints
>> that HW support.  Thus, we can avoid a generic C-function that need to
>> check for all the possible hints, and instead have a BPF-prog that only
>> contains the code that is relevant for this net_device.
> 
> But that's exactly what the xdp_copy_metadata_for_skb() is! It's a
> dynamic "BPF program" (generated as unrolled kfunc calls) just running
> in the helper and stashing the results in an intermediate struct in the
> metadata area. And once it's done that, we don't need *another* dynamic
> BPF program to read it back out and populate the SKB, because the
> intermediate format it's been stashed into is under the control of the
> kernel (we just need a flag to indicate that it's there).
> 
>>>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>>>>> each of the kfuncs needed for the metadata used by SKBs, all of which
>>>>> get unrolled. The helper takes the output of these metadata-extracting
>>>>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>>>>> metadata area; but in any case, since it's hidden away inside a helper
>>>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>>>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>>>>> then just read statically. We could even make this format match the
>>>>> field layout of struct sk_buff, so all we have to do is memcpy a
>>>>> contiguous chunk of memory when building the SKB.
>>>>
>>>> +1
>>
>> Sorry, I think this "hiding" layout trick is going in a wrong direction.
>>
>> Imagine the use-case of cpumap redirect.  The physical device XDP-prog
>> calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it
>> calls redirect into cpumap.  On remote CPU, the xdp_frame is picked up,
>> and then I want to run another XDP-prog that want to look at these
>> HW-hints, and then likely call XDP_PASS which creates the SKB, also
>> using these HW-hints.  I take it, it would not be possible when using
>> the xdp_copy_metadata_for_skb() helper?
> 
> You're right that it should be possible to read the values back out
> again later. That is totally possible with this scheme, though; the
> 'xdp_to_skb_metadata' is going to be in the vmlinux BTF, so an XDP
> program can just read that. We can explicitly support it by using the
> BTF ID as the "magic value" as you suggest, which would be fine by me.
> 

I'm on-board if we as you suggest, add the BTF_ID as the "magic value" 
(as last member due to AF_XDP processing).  When 
xdp_copy_metadata_for_skb() writes 'xdp_to_skb_metadata' in metadata 
area.  We should simply see this BTF-ID as a 'cookie' or magic number.

> I still think we should be using the __randomize_layout trick, though,
> precisely so that BPF consumers are forced to use BTF relocations to
> read it; otherwise we risk the struct layout ossifying into UAPI because
> people are just going to assume it's static...
> 

I'm also on-board with some level of randomization to the struct to 
force consumers to use BTF for relocations. e.g BTF-ID cookie/magic 
should be at a fixed location.


>>>> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>>>>
>>>> struct xdp_to_skb_metadata {
>>>>     u32 magic; // randomized at boot
>>>>     ... skb-consumable-metadata in fixed format
>>>> } __randomize_layout;
>>>>
>>>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
>>>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
>>>> to fill in the actual data.
>>>>
>>>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
>>>> C code that parses that 'struct xdp_to_skb_metadata'.
>>>> (To be precise, I'm trying to parse the metadata from
>>>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
>>>> 100% sure that's the right place).
>>>> (I also randomize the layout and magic to make sure userspace doesn't
>>>> depend on it because nothing stops this packet to be routed into xsk
>>>> socket..)
>>>
>>> Ah, nice trick with __randomize_layout - I agree we need to do something
>>> to prevent userspace from inadvertently starting to rely on this, and
>>> this seems like a great solution!
>>
>> Sorry, I disagree where this is going.  Why do all of a sudden want to
>> prevent userspace (e.g. AF_XDP) from using this data?!?
> 
> See above: I don't think we should prevent userspace from using it (and
> we're not), but we should prevent the struct layout from ossifying.
> 

Okay, then we are in agreement, avoid ossifying struct layout.

>> The hole exercise started with wanting to provide AF_XDP with these
>> HW-hints. The hints a standard AF_XDP user wants is likely very
>> similar to what the SKB user wants. Why do the AF_XDP user need to
>> open code this?
>>
>> The BTF-ID scheme precisely allows us to expose this layout to
>> userspace, and at the same time have freedom to change this in kernel
>> space, as userspace must decode the BTF-layout before reading this.
>> I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID
>> scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to
>> ask for?  You can just consider the BTF-ID as the magic number, as it
>> will be more-or-less random per kernel build (and module load order).
> 
> As mentioned above, I would be totally fine with just having the
> xdp_to_skb_metadata be part of BTF, enabling both XDP programs and
> AF_XDP consumers to re-use it.
>

My use-case is that AF_XDP will need to key on this runtime BTF-ID magic 
value anyway to read out 'xdp_to_skb_metadata' values.  I have an 
XDP-prog running, that will RX-timestamp only the time-sync protocol 
packets.  Code wise, I will simply add my RX-timestamp on top of struct 
'xdp_to_skb_metadata' and then update the BTF-ID magic value.  I don't 
need to add a real BTF-ID but just some magic value that my AF_XDP 
userspace prog knows about.  In my current code[1], I'm playing nice and 
adds the BPF-prog's own local BTF-ID via bpf_core_type_id_local().

[1] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80


>>> Look forward to seeing what the whole thing looks like in a more
>>> complete form :)
>>
>> I'm sort of on-board with the kfuncs and unroll-tricks, if I can see
>> some driver code that handles the issues of getting HW setup state
>> exposed needed to decode the RX-desc format.
>>
>> I sense that I myself, haven't been good enough to explain/convey the
>> BTF-ID scheme.  Next week, I will code some examples that demo how
>> BTF-IDs can be used from BPF-progs, even as a communication channel
>> between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog ->
>> TC-BPF).
> 
> For my part at least, it's not a lack of understanding that makes me
> prefer the kfunc approach. Rather, it's the complexity of having to
> resolve the multiple BTF IDs, and the risk of ossifying the struct
> layouts because people are going to do that wrong. Using kfuncs gives us
> much more control of the API, especially if we combine it with struct
> randomisation for the bits we do expose.
> 
> Translating what we've discussed above into the terms used in your patch
> series, this would correspond to *only* having the xdp_metadata_common
> struct exposed via BTF, and not bothering with all the other
> driver-specific layouts. So an XDP/AF_XDP user that only wants to use
> the metadata that's also used by the stack can just call
> xdp_copy_metadata_for_skb(), and then read the resulting metadata area
> (using BTF). And if someone wants to access metadata that's *not* used
> by the stack, they'd have to call additional kfuncs to extract that.
> 

I agree, that this patchset does/will simplify my patchset.  My driver 
specific structs for BTF-IDs will no-longer be needed.  As it is now 
up-to XDP-prog to explicitly extend with fields. This should reduce your 
concern with resolving multiple BTF IDs.

Maybe after this patchset, I would suggest that we could create some 
"kernel-central" struct's that have e.g. RX-timestamp and mark (mlx5 
have HW support for mark) and protocol types (via RX-hash). As these 
could be used by XDP-prog's that explicitly extract these, and 
communicate the layout via setting the BTF-ID (via calling 
bpf_core_type_id_kernel()).  Making it easier to consume from chained 
BPF-progs, AF_XDP and even via kernel C-code that updates SKB fields as 
the number of these magic BTF-ID's will be small enough.


> And similarly, if someone wants only a subset of the metadata used by an
> SKB, they can just *not* call xdp_copy_metadata_for_skb(), and instead
> just call the individual kfuncs to extract just the fields they want.
> 
> I think this strikes a nice balance between the flexibility by the
> kernel to change things, the flexibility of XDP consumers to request
> only the data they want, and the ability for the same metadata to be
> consumed at different points. WDYT?

With above comments, I think we are closer to an agreement again.

--Jesper

[RFC,bpf-next,0/5] xdp: hints via kfuncs

Message

Comments