mbox series

[RFCv2,bpf-next,00/18] XDP-hints: XDP gaining access to HW offload hints via BTF

Message ID 166256538687.1434226.15760041133601409770.stgit@firesoul (mailing list archive)
Headers show
Series XDP-hints: XDP gaining access to HW offload hints via BTF | expand

Message

Jesper Dangaard Brouer Sept. 7, 2022, 3:45 p.m. UTC
This patchset expose the traditional hardware offload hints to XDP and
rely on BTF to expose the layout to users.

Main idea is that the kernel and NIC drivers simply defines the struct
layouts they choose to use for XDP-hints. These XDP-hints structs gets
naturally and automatically described via BTF and implicitly exported to
users. NIC drivers populate and records their own BTF ID as the last
member in XDP metadata area (making it easily accessible by AF_XDP
userspace at a known negative offset from packet data start).

Naming conventions for the structs (xdp_hints_*) is used such that
userspace can find and decode the BTF layout and match against the
provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
what XDP-hints a driver supports.

The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
union named "xdp_hints_union" in every driver, which contains all
xdp_hints_* struct this driver can support. This makes it easier/quicker
to find and parse the relevant BTF types.  (Seeking input before fixing
up all drivers in patchset).


The main different from RFC-v1:
 - Drop idea of BTF "origin" (vmlinux, module or local)
 - Instead to use full 64-bit BTF ID that combine object+type ID

I've taken some of Alexandr/Larysa's libbpf patches and integrated
those.

Patchset exceeds netdev usually max 15 patches rule. My excuse is three
NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
required some refactoring to remove the SKB dependencies.


---

Jesper Dangaard Brouer (10):
      net: create xdp_hints_common and set functions
      net: add net_device feature flag for XDP-hints
      xdp: controlling XDP-hints from BPF-prog via helper
      i40e: Refactor i40e_ptp_rx_hwtstamp
      i40e: refactor i40e_rx_checksum with helper
      bpf: export btf functions for modules
      btf: Add helper for kernel modules to lookup full BTF ID
      i40e: add XDP-hints handling
      net: use XDP-hints in xdp_frame to SKB conversion
      i40e: Add xdp_hints_union

Larysa Zaremba (3):
      libbpf: factor out BTF loading from load_module_btfs()
      libbpf: try to load vmlinux BTF from the kernel first
      libbpf: patch module BTF obj+type ID into BPF insns

Lorenzo Bianconi (1):
      mvneta: add XDP-hints support

Maryam Tahhan (4):
      ixgbe: enable xdp-hints
      ixgbe: add rx timestamp xdp hints support
      xsk: AF_XDP xdp-hints support in desc options
      ixgbe: AF_XDP xdp-hints processing in ixgbe_clean_rx_irq_zc


 drivers/net/ethernet/intel/i40e/i40e.h        |   1 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  22 ++
 drivers/net/ethernet/intel/i40e/i40e_ptp.c    |  36 ++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 252 ++++++++++++++---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   5 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 217 +++++++++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c  |  82 ++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |   2 +
 drivers/net/ethernet/marvell/mvneta.c         |  59 +++-
 include/linux/btf.h                           |   3 +
 include/linux/netdev_features.h               |   3 +-
 include/net/xdp.h                             | 256 +++++++++++++++++-
 include/uapi/linux/bpf.h                      |  35 +++
 include/uapi/linux/if_xdp.h                   |   2 +-
 kernel/bpf/btf.c                              |  36 ++-
 net/core/filter.c                             |  52 ++++
 net/core/xdp.c                                |  22 +-
 net/ethtool/common.c                          |   1 +
 net/xdp/xsk.c                                 |   2 +-
 net/xdp/xsk_queue.h                           |   3 +-
 tools/lib/bpf/bpf_core_read.h                 |   3 +-
 tools/lib/bpf/btf.c                           | 142 +++++++++-
 tools/lib/bpf/libbpf.c                        |  52 +---
 tools/lib/bpf/libbpf_internal.h               |   7 +-
 tools/lib/bpf/relo_core.c                     |   8 +-
 tools/lib/bpf/relo_core.h                     |   1 +
 26 files changed, 1127 insertions(+), 177 deletions(-)

--

Comments

Alexander Lobakin Sept. 8, 2022, 9:30 a.m. UTC | #1
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 07 Sep 2022 17:45:00 +0200

> This patchset expose the traditional hardware offload hints to XDP and
> rely on BTF to expose the layout to users.
> 
> Main idea is that the kernel and NIC drivers simply defines the struct
> layouts they choose to use for XDP-hints. These XDP-hints structs gets
> naturally and automatically described via BTF and implicitly exported to
> users. NIC drivers populate and records their own BTF ID as the last
> member in XDP metadata area (making it easily accessible by AF_XDP
> userspace at a known negative offset from packet data start).
> 
> Naming conventions for the structs (xdp_hints_*) is used such that
> userspace can find and decode the BTF layout and match against the
> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
> what XDP-hints a driver supports.
> 
> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
> union named "xdp_hints_union" in every driver, which contains all
> xdp_hints_* struct this driver can support. This makes it easier/quicker
> to find and parse the relevant BTF types.  (Seeking input before fixing
> up all drivers in patchset).
> 
> 
> The main different from RFC-v1:
>  - Drop idea of BTF "origin" (vmlinux, module or local)
>  - Instead to use full 64-bit BTF ID that combine object+type ID
> 
> I've taken some of Alexandr/Larysa's libbpf patches and integrated
> those.

Not sure if it's okay to inform the authors about the fact only
after sending? Esp from the eeeh... "incompatible" implementation?
I realize it's open code, but this looks sorta depreciatingly.

> 
> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
> required some refactoring to remove the SKB dependencies.
> 
> 
> ---
> 
> Jesper Dangaard Brouer (10):
>       net: create xdp_hints_common and set functions
>       net: add net_device feature flag for XDP-hints
>       xdp: controlling XDP-hints from BPF-prog via helper
>       i40e: Refactor i40e_ptp_rx_hwtstamp
>       i40e: refactor i40e_rx_checksum with helper
>       bpf: export btf functions for modules
>       btf: Add helper for kernel modules to lookup full BTF ID
>       i40e: add XDP-hints handling
>       net: use XDP-hints in xdp_frame to SKB conversion
>       i40e: Add xdp_hints_union
> 
> Larysa Zaremba (3):
>       libbpf: factor out BTF loading from load_module_btfs()
>       libbpf: try to load vmlinux BTF from the kernel first
>       libbpf: patch module BTF obj+type ID into BPF insns
> 
> Lorenzo Bianconi (1):
>       mvneta: add XDP-hints support
> 
> Maryam Tahhan (4):
>       ixgbe: enable xdp-hints
>       ixgbe: add rx timestamp xdp hints support
>       xsk: AF_XDP xdp-hints support in desc options
>       ixgbe: AF_XDP xdp-hints processing in ixgbe_clean_rx_irq_zc
> 
> 
>  drivers/net/ethernet/intel/i40e/i40e.h        |   1 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |  22 ++
>  drivers/net/ethernet/intel/i40e/i40e_ptp.c    |  36 ++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 252 ++++++++++++++---
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   5 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 217 +++++++++++++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c  |  82 ++++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |   2 +
>  drivers/net/ethernet/marvell/mvneta.c         |  59 +++-
>  include/linux/btf.h                           |   3 +
>  include/linux/netdev_features.h               |   3 +-
>  include/net/xdp.h                             | 256 +++++++++++++++++-
>  include/uapi/linux/bpf.h                      |  35 +++
>  include/uapi/linux/if_xdp.h                   |   2 +-
>  kernel/bpf/btf.c                              |  36 ++-
>  net/core/filter.c                             |  52 ++++
>  net/core/xdp.c                                |  22 +-
>  net/ethtool/common.c                          |   1 +
>  net/xdp/xsk.c                                 |   2 +-
>  net/xdp/xsk_queue.h                           |   3 +-
>  tools/lib/bpf/bpf_core_read.h                 |   3 +-
>  tools/lib/bpf/btf.c                           | 142 +++++++++-
>  tools/lib/bpf/libbpf.c                        |  52 +---
>  tools/lib/bpf/libbpf_internal.h               |   7 +-
>  tools/lib/bpf/relo_core.c                     |   8 +-
>  tools/lib/bpf/relo_core.h                     |   1 +
>  26 files changed, 1127 insertions(+), 177 deletions(-)
> 
> --

Olek
Jesper Dangaard Brouer Sept. 9, 2022, 1:48 p.m. UTC | #2
On 08/09/2022 11.30, Alexander Lobakin wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Wed, 07 Sep 2022 17:45:00 +0200
> 
>> This patchset expose the traditional hardware offload hints to XDP and
>> rely on BTF to expose the layout to users.
>>
[...]
>> The main different from RFC-v1:
>>   - Drop idea of BTF "origin" (vmlinux, module or local)
>>   - Instead to use full 64-bit BTF ID that combine object+type ID
>>
>> I've taken some of Alexandr/Larysa's libbpf patches and integrated
>> those.
> 
> Not sure if it's okay to inform the authors about the fact only
> after sending? Esp from the eeeh... "incompatible" implementation?

Just to be clear: I have made sure that developers of the patches
maintain authorship (when applied to git via the From: line) and I've
Cc'ed the developers directly.  I didn't Cc you directly as I knew you
would be included via XDP-hints list, and I didn't directly use one of
your patches.

> I realize it's open code, but this looks sorta depreciatingly.

After discussions with Larysa on pre-patchset, I was convinced of the
idea of a full 64-bit BTF ID.  Thus, I took those patches and carried
them in my patchset, instead of reimplementing the same myself.
Precisely out of respect for Larysa's work as I wanted to give her
credit for coding this.

I'm very interested in collaborating.  That is why I have picked up
patches from your patchset and are carrying them forward.  I could just
as easily reimplemented them myself.

--Jesper
Stanislav Fomichev Oct. 3, 2022, 11:55 p.m. UTC | #3
On 09/07, Jesper Dangaard Brouer wrote:
> This patchset expose the traditional hardware offload hints to XDP and
> rely on BTF to expose the layout to users.

> Main idea is that the kernel and NIC drivers simply defines the struct
> layouts they choose to use for XDP-hints. These XDP-hints structs gets
> naturally and automatically described via BTF and implicitly exported to
> users. NIC drivers populate and records their own BTF ID as the last
> member in XDP metadata area (making it easily accessible by AF_XDP
> userspace at a known negative offset from packet data start).

> Naming conventions for the structs (xdp_hints_*) is used such that
> userspace can find and decode the BTF layout and match against the
> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
> what XDP-hints a driver supports.

> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
> union named "xdp_hints_union" in every driver, which contains all
> xdp_hints_* struct this driver can support. This makes it easier/quicker
> to find and parse the relevant BTF types.  (Seeking input before fixing
> up all drivers in patchset).


> The main different from RFC-v1:
>   - Drop idea of BTF "origin" (vmlinux, module or local)
>   - Instead to use full 64-bit BTF ID that combine object+type ID

> I've taken some of Alexandr/Larysa's libbpf patches and integrated
> those.

> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
> required some refactoring to remove the SKB dependencies.

Hey Jesper,

I took a quick look at the series. Do we really need the enum with the  
flags?
We might eventually hit that "first 16 bits are reserved" issue?

Instead of exposing enum with the flags, why not solve it as follows:
a. We define UAPI struct xdp_rx_hints with _all_ possible hints
b. Each device defines much denser <device>_xdp_rx_hints struct with the
    metadata that it supports
c. The subset of fields in <device>_xdp_rx_hints should match the ones from
    xdp_rx_hints (we essentially standardize on the field names/sizes)
d. We expose <device>_xdp_rx_hints btf id via netlink for each device
e. libbpf will query and do offset relocations for
    xdp_rx_hints -> <device>_xdp_rx_hints at load time

Would that work? Then it seems like we can replace bitfields with the  
following:

   if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
     /* use that hint */
   }

All we need here is for libbpf to, again, do xdp_rx_hints ->
<device>_xdp_rx_hints translation before it evaluates  
bpf_core_field_exists()?

Thoughts? Any downsides? Am I missing something?

Also, about the TX side: I feel like the same can be applied there,
the program works with xdp_tx_hints and libbpf will rewrite to
<device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1";
those, presumably, can be relocatable by libbpf as well?
Jesper Dangaard Brouer Oct. 4, 2022, 9:29 a.m. UTC | #4
On 04/10/2022 01.55, sdf@google.com wrote:
> On 09/07, Jesper Dangaard Brouer wrote:
>> This patchset expose the traditional hardware offload hints to XDP and
>> rely on BTF to expose the layout to users.
> 
>> Main idea is that the kernel and NIC drivers simply defines the struct
>> layouts they choose to use for XDP-hints. These XDP-hints structs gets
>> naturally and automatically described via BTF and implicitly exported to
>> users. NIC drivers populate and records their own BTF ID as the last
>> member in XDP metadata area (making it easily accessible by AF_XDP
>> userspace at a known negative offset from packet data start).
> 
>> Naming conventions for the structs (xdp_hints_*) is used such that
>> userspace can find and decode the BTF layout and match against the
>> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
>> what XDP-hints a driver supports.
> 
>> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
>> union named "xdp_hints_union" in every driver, which contains all
>> xdp_hints_* struct this driver can support. This makes it easier/quicker
>> to find and parse the relevant BTF types.  (Seeking input before fixing
>> up all drivers in patchset).
> 
> 
>> The main different from RFC-v1:
>>   - Drop idea of BTF "origin" (vmlinux, module or local)
>>   - Instead to use full 64-bit BTF ID that combine object+type ID
> 
>> I've taken some of Alexandr/Larysa's libbpf patches and integrated
>> those.
> 
>> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
>> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
>> required some refactoring to remove the SKB dependencies.
> 
> Hey Jesper,
> 
> I took a quick look at the series. 
Appreciate that! :-)

> Do we really need the enum with the flags?

The primary reason for using enum is that these gets exposed as BTF.
The proposal is that userspace/BTF need to obtain the flags via BTF,
such that they don't become UAPI, but something we can change later.

> We might eventually hit that "first 16 bits are reserved" issue?
> 
> Instead of exposing enum with the flags, why not solve it as follows:
> a. We define UAPI struct xdp_rx_hints with _all_ possible hints

How can we know _all_ possible hints from the beginning(?).

UAPI + central struct dictating all possible hints, will limit innovation.

> b. Each device defines much denser <device>_xdp_rx_hints struct with the
>     metadata that it supports

Thus, the NIC device is limited to what is defined in UAPI struct
xdp_rx_hints.  Again this limits innovation.

> c. The subset of fields in <device>_xdp_rx_hints should match the ones from
>     xdp_rx_hints (we essentially standardize on the field names/sizes)
> d. We expose <device>_xdp_rx_hints btf id via netlink for each device

For this proposed design you would still need more than one BTF ID or
<device>_xdp_rx_hints struct's, because not all packets contains all
hints. The most common case is HW timestamping, which some HW only
supports for PTP frames.

Plus, I don't see a need to expose anything via netlink, as we can just
use the existing BTF information from the module.  Thus, avoiding to
creating more UAPI.

> e. libbpf will query and do offset relocations for
>     xdp_rx_hints -> <device>_xdp_rx_hints at load time
> 
> Would that work? Then it seems like we can replace bitfields with the 

I used to be a fan of bitfields, until I discovered that they are bad
for performance, because compilers cannot optimize these.

> following:
> 
>    if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
>      /* use that hint */

Fairly often a VLAN will not be set in packets, so we still have to read
and check a bitfield/flag if the VLAN value is valid. (Guess it is
implicit in above code).

>    }
> 
> All we need here is for libbpf to, again, do xdp_rx_hints ->
> <device>_xdp_rx_hints translation before it evaluates 
> bpf_core_field_exists()?
> 
> Thoughts? Any downsides? Am I missing something?
> 

Well, the downside is primarily that this design limits innovation.

Each time a NIC driver want to introduce a new hardware hint, they have
to update the central UAPI xdp_rx_hints struct first.

The design in the patchset is to open for innovation.  Driver can extend
their own xdp_hints_<driver>_xxx struct(s).  They still have to land
their patches upstream, but avoid mangling a central UAPI struct. As
upstream we review driver changes and should focus on sane struct member
naming(+size) especially if this "sounds" like a hint/feature that more
driver are likely to support.  With help from BTF relocations, a new
driver can support same hint/feature if naming(+size) match (without
necessary the same offset in the struct).

> Also, about the TX side: I feel like the same can be applied there,
> the program works with xdp_tx_hints and libbpf will rewrite to
> <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1";
> those, presumably, can be relocatable by libbpf as well?
> 

Good to think ahead for TX-side, even-though I think we should focus on
landing RX-side first.

I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
common struct xdp_hints_common, without a RX/TX direction indication.
Maybe this is wrong of me, but my thinking was that most of the common
hints can be directly used as TX-side hints.  I'm hoping TX-side
xdp-hints will need to do little-to-non adjustment, before using the
hints as TX "instruction".  I'm hoping that XDP-redirect will just work
and xmit driver can use XDP-hints area.

Please correct me if I'm wrong.
The checksum fields hopefully translates to similar TX offload "actions".
The VLAN offload hint should translate directly to TX-side.

I can easily be convinced we should name it xdp_hints_rx_common from the
start, but then I will propose that xdp_hints_tx_common have the
checksum and VLAN fields+flags at same locations, such that we don't
take any performance hint for moving them to "TX-side" hints, making
XDP-redirect just work.

--Jesper
Stanislav Fomichev Oct. 4, 2022, 6:26 p.m. UTC | #5
On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 04/10/2022 01.55, sdf@google.com wrote:
> > On 09/07, Jesper Dangaard Brouer wrote:
> >> This patchset expose the traditional hardware offload hints to XDP and
> >> rely on BTF to expose the layout to users.
> >
> >> Main idea is that the kernel and NIC drivers simply defines the struct
> >> layouts they choose to use for XDP-hints. These XDP-hints structs gets
> >> naturally and automatically described via BTF and implicitly exported to
> >> users. NIC drivers populate and records their own BTF ID as the last
> >> member in XDP metadata area (making it easily accessible by AF_XDP
> >> userspace at a known negative offset from packet data start).
> >
> >> Naming conventions for the structs (xdp_hints_*) is used such that
> >> userspace can find and decode the BTF layout and match against the
> >> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
> >> what XDP-hints a driver supports.
> >
> >> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
> >> union named "xdp_hints_union" in every driver, which contains all
> >> xdp_hints_* struct this driver can support. This makes it easier/quicker
> >> to find and parse the relevant BTF types.  (Seeking input before fixing
> >> up all drivers in patchset).
> >
> >
> >> The main different from RFC-v1:
> >>   - Drop idea of BTF "origin" (vmlinux, module or local)
> >>   - Instead to use full 64-bit BTF ID that combine object+type ID
> >
> >> I've taken some of Alexandr/Larysa's libbpf patches and integrated
> >> those.
> >
> >> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
> >> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
> >> required some refactoring to remove the SKB dependencies.
> >
> > Hey Jesper,
> >
> > I took a quick look at the series.
> Appreciate that! :-)
>
> > Do we really need the enum with the flags?
>
> The primary reason for using enum is that these gets exposed as BTF.
> The proposal is that userspace/BTF need to obtain the flags via BTF,
> such that they don't become UAPI, but something we can change later.
>
> > We might eventually hit that "first 16 bits are reserved" issue?
> >
> > Instead of exposing enum with the flags, why not solve it as follows:
> > a. We define UAPI struct xdp_rx_hints with _all_ possible hints
>
> How can we know _all_ possible hints from the beginning(?).
>
> UAPI + central struct dictating all possible hints, will limit innovation.

We don't need to know them all in advance. The same way we don't know
them all for flags enum. That UAPI xdp_rx_hints can be extended any
time some driver needs some new hint offload. The benefit here is that
we have a "common registry" of all offloads and different drivers have
an opportunity to share.

Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake
uapi struct (__sk_buff) and the access to it gets translated into
<device>_xdp_rx_hints offsets (sk_buff).

> > b. Each device defines much denser <device>_xdp_rx_hints struct with the
> >     metadata that it supports
>
> Thus, the NIC device is limited to what is defined in UAPI struct
> xdp_rx_hints.  Again this limits innovation.

I guess what I'm missing from your series is the bpf/userspace side.
Do you have an example on the bpf side that will work for, say,
xdp_hints_ixgbe_timestamp?

Suppose, you pass this custom hints btf_id via xdp_md as proposed,
what's the action on the bpf side to consume this?

If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly
populated at runtime by libbpf? */) {
  // do something with rx_timestamp
  // also, handle xdp_hints_ixgbe and then xdp_hints_common ?
} else if (ctx_hints_btf_id == xdp_hints_ixgbe) {
  // do something else
  // plus explicitly handle xdp_hints_common here?
} else {
  // handle xdp_hints_common
}

What I'd like to avoid is an xdp program targeting specific drivers.
Where possible, we should aim towards something like "if this device
has rx_timestamp offload -> use it without depending too much on
specific btf_ids.

> > c. The subset of fields in <device>_xdp_rx_hints should match the ones from
> >     xdp_rx_hints (we essentially standardize on the field names/sizes)
> > d. We expose <device>_xdp_rx_hints btf id via netlink for each device
>
> For this proposed design you would still need more than one BTF ID or
> <device>_xdp_rx_hints struct's, because not all packets contains all
> hints. The most common case is HW timestamping, which some HW only
> supports for PTP frames.
>
> Plus, I don't see a need to expose anything via netlink, as we can just
> use the existing BTF information from the module.  Thus, avoiding to
> creating more UAPI.

See above. I think even with your series, that btf_id info should also
come via netlink so the programs can query it before loading and do
the required adjustments. Otherwise, I'm not sure I understand what I
need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too
late? I need to know them in advance to at least populate those ids
into the bpf program itself?

> > e. libbpf will query and do offset relocations for
> >     xdp_rx_hints -> <device>_xdp_rx_hints at load time
> >
> > Would that work? Then it seems like we can replace bitfields with the
>
> I used to be a fan of bitfields, until I discovered that they are bad
> for performance, because compilers cannot optimize these.

Ack, good point, something to keep in mind.

> > following:
> >
> >    if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
> >      /* use that hint */
>
> Fairly often a VLAN will not be set in packets, so we still have to read
> and check a bitfield/flag if the VLAN value is valid. (Guess it is
> implicit in above code).

That's a fair point. Then we need two signals?

1. Whether this particular offload is supported for the device at all
(via that bpf_core_field_exists or something similar)
2. Whether this particular packet has particular metadata (via your
proposed flags)

if (device I'm attaching xdp to has vlan offload) { // via
bpf_core_field_exists?
  if (particular packet comes with a vlan tag) { // via your proposed
bitfield flags?
  }
}

Or are we assuming that (2) is fast enough and we don't care about
(1)? Because (1) can 'if (0)' the whole branch and make the verifier
remove that part.

> >    }
> >
> > All we need here is for libbpf to, again, do xdp_rx_hints ->
> > <device>_xdp_rx_hints translation before it evaluates
> > bpf_core_field_exists()?
> >
> > Thoughts? Any downsides? Am I missing something?
> >
>
> Well, the downside is primarily that this design limits innovation.
>
> Each time a NIC driver want to introduce a new hardware hint, they have
> to update the central UAPI xdp_rx_hints struct first.
>
> The design in the patchset is to open for innovation.  Driver can extend
> their own xdp_hints_<driver>_xxx struct(s).  They still have to land
> their patches upstream, but avoid mangling a central UAPI struct. As
> upstream we review driver changes and should focus on sane struct member
> naming(+size) especially if this "sounds" like a hint/feature that more
> driver are likely to support.  With help from BTF relocations, a new
> driver can support same hint/feature if naming(+size) match (without
> necessary the same offset in the struct).

The opposite side of this approach is that we'll have 'ixgbe_hints'
with 'rx_timestamp' and 'mvneta_hints' with something like
'rx_tstamp'.

> > Also, about the TX side: I feel like the same can be applied there,
> > the program works with xdp_tx_hints and libbpf will rewrite to
> > <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1";
> > those, presumably, can be relocatable by libbpf as well?
> >
>
> Good to think ahead for TX-side, even-though I think we should focus on
> landing RX-side first.
>
> I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
> common struct xdp_hints_common, without a RX/TX direction indication.
> Maybe this is wrong of me, but my thinking was that most of the common
> hints can be directly used as TX-side hints.  I'm hoping TX-side
> xdp-hints will need to do little-to-non adjustment, before using the
> hints as TX "instruction".  I'm hoping that XDP-redirect will just work
> and xmit driver can use XDP-hints area.
>
> Please correct me if I'm wrong.
> The checksum fields hopefully translates to similar TX offload "actions".
> The VLAN offload hint should translate directly to TX-side.
>
> I can easily be convinced we should name it xdp_hints_rx_common from the
> start, but then I will propose that xdp_hints_tx_common have the
> checksum and VLAN fields+flags at same locations, such that we don't
> take any performance hint for moving them to "TX-side" hints, making
> XDP-redirect just work.

Might be good to think about this beforehand. I agree that most of the
layout should hopefully match. However once case that I'm interested
in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp
in the metadata; for tx, I'm merely setting a flag somewhere to
request it for async delivery later (I hope we plan to support that
for af_xdp?). So the layout might be completely different :-(

On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 04/10/2022 01.55, sdf@google.com wrote:
> > On 09/07, Jesper Dangaard Brouer wrote:
> >> This patchset expose the traditional hardware offload hints to XDP and
> >> rely on BTF to expose the layout to users.
> >
> >> Main idea is that the kernel and NIC drivers simply defines the struct
> >> layouts they choose to use for XDP-hints. These XDP-hints structs gets
> >> naturally and automatically described via BTF and implicitly exported to
> >> users. NIC drivers populate and records their own BTF ID as the last
> >> member in XDP metadata area (making it easily accessible by AF_XDP
> >> userspace at a known negative offset from packet data start).
> >
> >> Naming conventions for the structs (xdp_hints_*) is used such that
> >> userspace can find and decode the BTF layout and match against the
> >> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
> >> what XDP-hints a driver supports.
> >
> >> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
> >> union named "xdp_hints_union" in every driver, which contains all
> >> xdp_hints_* struct this driver can support. This makes it easier/quicker
> >> to find and parse the relevant BTF types.  (Seeking input before fixing
> >> up all drivers in patchset).
> >
> >
> >> The main different from RFC-v1:
> >>   - Drop idea of BTF "origin" (vmlinux, module or local)
> >>   - Instead to use full 64-bit BTF ID that combine object+type ID
> >
> >> I've taken some of Alexandr/Larysa's libbpf patches and integrated
> >> those.
> >
> >> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
> >> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
> >> required some refactoring to remove the SKB dependencies.
> >
> > Hey Jesper,
> >
> > I took a quick look at the series.
> Appreciate that! :-)
>
> > Do we really need the enum with the flags?
>
> The primary reason for using enum is that these gets exposed as BTF.
> The proposal is that userspace/BTF need to obtain the flags via BTF,
> such that they don't become UAPI, but something we can change later.
>
> > We might eventually hit that "first 16 bits are reserved" issue?
> >
> > Instead of exposing enum with the flags, why not solve it as follows:
> > a. We define UAPI struct xdp_rx_hints with _all_ possible hints
>
> How can we know _all_ possible hints from the beginning(?).
>
> UAPI + central struct dictating all possible hints, will limit innovation.
>
> > b. Each device defines much denser <device>_xdp_rx_hints struct with the
> >     metadata that it supports
>
> Thus, the NIC device is limited to what is defined in UAPI struct
> xdp_rx_hints.  Again this limits innovation.
>
> > c. The subset of fields in <device>_xdp_rx_hints should match the ones from
> >     xdp_rx_hints (we essentially standardize on the field names/sizes)
> > d. We expose <device>_xdp_rx_hints btf id via netlink for each device
>
> For this proposed design you would still need more than one BTF ID or
> <device>_xdp_rx_hints struct's, because not all packets contains all
> hints. The most common case is HW timestamping, which some HW only
> supports for PTP frames.
>
> Plus, I don't see a need to expose anything via netlink, as we can just
> use the existing BTF information from the module.  Thus, avoiding to
> creating more UAPI.
>
> > e. libbpf will query and do offset relocations for
> >     xdp_rx_hints -> <device>_xdp_rx_hints at load time
> >
> > Would that work? Then it seems like we can replace bitfields with the
>
> I used to be a fan of bitfields, until I discovered that they are bad
> for performance, because compilers cannot optimize these.
>
> > following:
> >
> >    if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
> >      /* use that hint */
>
> Fairly often a VLAN will not be set in packets, so we still have to read
> and check a bitfield/flag if the VLAN value is valid. (Guess it is
> implicit in above code).
>
> >    }
> >
> > All we need here is for libbpf to, again, do xdp_rx_hints ->
> > <device>_xdp_rx_hints translation before it evaluates
> > bpf_core_field_exists()?
> >
> > Thoughts? Any downsides? Am I missing something?
> >
>
> Well, the downside is primarily that this design limits innovation.
>
> Each time a NIC driver want to introduce a new hardware hint, they have
> to update the central UAPI xdp_rx_hints struct first.
>
> The design in the patchset is to open for innovation.  Driver can extend
> their own xdp_hints_<driver>_xxx struct(s).  They still have to land
> their patches upstream, but avoid mangling a central UAPI struct. As
> upstream we review driver changes and should focus on sane struct member
> naming(+size) especially if this "sounds" like a hint/feature that more
> driver are likely to support.  With help from BTF relocations, a new
> driver can support same hint/feature if naming(+size) match (without
> necessary the same offset in the struct).
>
> > Also, about the TX side: I feel like the same can be applied there,
> > the program works with xdp_tx_hints and libbpf will rewrite to
> > <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1";
> > those, presumably, can be relocatable by libbpf as well?
> >
>
> Good to think ahead for TX-side, even-though I think we should focus on
> landing RX-side first.
>
> I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
> common struct xdp_hints_common, without a RX/TX direction indication.
> Maybe this is wrong of me, but my thinking was that most of the common
> hints can be directly used as TX-side hints.  I'm hoping TX-side
> xdp-hints will need to do little-to-non adjustment, before using the
> hints as TX "instruction".  I'm hoping that XDP-redirect will just work
> and xmit driver can use XDP-hints area.
>
> Please correct me if I'm wrong.
> The checksum fields hopefully translates to similar TX offload "actions".
> The VLAN offload hint should translate directly to TX-side.
>
> I can easily be convinced we should name it xdp_hints_rx_common from the
> start, but then I will propose that xdp_hints_tx_common have the
> checksum and VLAN fields+flags at same locations, such that we don't
> take any performance hint for moving them to "TX-side" hints, making
> XDP-redirect just work.
>
> --Jesper
>
Martin KaFai Lau Oct. 5, 2022, 12:25 a.m. UTC | #6
On 10/4/22 11:26 AM, Stanislav Fomichev wrote:
> On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>> On 04/10/2022 01.55, sdf@google.com wrote:
>>> On 09/07, Jesper Dangaard Brouer wrote:
>>>> This patchset expose the traditional hardware offload hints to XDP and
>>>> rely on BTF to expose the layout to users.
>>>
>>>> Main idea is that the kernel and NIC drivers simply defines the struct
>>>> layouts they choose to use for XDP-hints. These XDP-hints structs gets
>>>> naturally and automatically described via BTF and implicitly exported to
>>>> users. NIC drivers populate and records their own BTF ID as the last
>>>> member in XDP metadata area (making it easily accessible by AF_XDP
>>>> userspace at a known negative offset from packet data start).
>>>
>>>> Naming conventions for the structs (xdp_hints_*) is used such that
>>>> userspace can find and decode the BTF layout and match against the
>>>> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
>>>> what XDP-hints a driver supports.
>>>
>>>> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
>>>> union named "xdp_hints_union" in every driver, which contains all
>>>> xdp_hints_* struct this driver can support. This makes it easier/quicker
>>>> to find and parse the relevant BTF types.  (Seeking input before fixing
>>>> up all drivers in patchset).
>>>
>>>
>>>> The main different from RFC-v1:
>>>>    - Drop idea of BTF "origin" (vmlinux, module or local)
>>>>    - Instead to use full 64-bit BTF ID that combine object+type ID
>>>
>>>> I've taken some of Alexandr/Larysa's libbpf patches and integrated
>>>> those.
>>>
>>>> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
>>>> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
>>>> required some refactoring to remove the SKB dependencies.
>>>
>>> Hey Jesper,
>>>
>>> I took a quick look at the series.
>> Appreciate that! :-)
>>
>>> Do we really need the enum with the flags?
>>
>> The primary reason for using enum is that these gets exposed as BTF.
>> The proposal is that userspace/BTF need to obtain the flags via BTF,
>> such that they don't become UAPI, but something we can change later.
>>
>>> We might eventually hit that "first 16 bits are reserved" issue?
>>>
>>> Instead of exposing enum with the flags, why not solve it as follows:
>>> a. We define UAPI struct xdp_rx_hints with _all_ possible hints
>>
>> How can we know _all_ possible hints from the beginning(?).
>>
>> UAPI + central struct dictating all possible hints, will limit innovation.
> 
> We don't need to know them all in advance. The same way we don't know
> them all for flags enum. That UAPI xdp_rx_hints can be extended any
> time some driver needs some new hint offload. The benefit here is that
> we have a "common registry" of all offloads and different drivers have
> an opportunity to share.
> 
> Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake
> uapi struct (__sk_buff) and the access to it gets translated into
> <device>_xdp_rx_hints offsets (sk_buff).
> 
>>> b. Each device defines much denser <device>_xdp_rx_hints struct with the
>>>      metadata that it supports
>>
>> Thus, the NIC device is limited to what is defined in UAPI struct
>> xdp_rx_hints.  Again this limits innovation.
> 
> I guess what I'm missing from your series is the bpf/userspace side.
> Do you have an example on the bpf side that will work for, say,
> xdp_hints_ixgbe_timestamp?

+1.  A selftest is useful.

> 
> Suppose, you pass this custom hints btf_id via xdp_md as proposed,
> what's the action on the bpf side to consume this?
> 
> If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly
> populated at runtime by libbpf? */) {
>    // do something with rx_timestamp
>    // also, handle xdp_hints_ixgbe and then xdp_hints_common ?
> } else if (ctx_hints_btf_id == xdp_hints_ixgbe) {
>    // do something else
>    // plus explicitly handle xdp_hints_common here?
> } else {
>    // handle xdp_hints_common
> }
> 
> What I'd like to avoid is an xdp program targeting specific drivers.
> Where possible, we should aim towards something like "if this device
> has rx_timestamp offload -> use it without depending too much on
> specific btf_ids.

It would be my preference also if it can avoid btf_id comparison of a specific 
driver like the above and let the libbpf CO-RE to handle the 
matching/relocation.  For rx hwtimestamp, the value could be just 0 if a 
specific hw/driver cannot provide it for all packets while some other hw can.

A intentionally wild question, what does it take for the driver to return the 
hints.  Is the rx_desc and rx_queue enough?  When the xdp prog is calling a 
kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver 
replace it with some inline bpf code (like how the inline code is generated for 
the map_lookup helper).  The xdp prog can then store the hwstamp in the meta 
area in any layout it wants.
Jakub Kicinski Oct. 5, 2022, 12:59 a.m. UTC | #7
On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote:
> A intentionally wild question, what does it take for the driver to return the 
> hints.  Is the rx_desc and rx_queue enough?  When the xdp prog is calling a 
> kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver 
> replace it with some inline bpf code (like how the inline code is generated for 
> the map_lookup helper).  The xdp prog can then store the hwstamp in the meta 
> area in any layout it wants.

Since you mentioned it... FWIW that was always my preference rather than
the BTF magic :)  The jited image would have to be per-driver like we
do for BPF offload but that's easy to do from the technical
perspective (I doubt many deployments bind the same prog to multiple
HW devices)..
Stanislav Fomichev Oct. 5, 2022, 1:02 a.m. UTC | #8
On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote:
> > A intentionally wild question, what does it take for the driver to return the
> > hints.  Is the rx_desc and rx_queue enough?  When the xdp prog is calling a
> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver
> > replace it with some inline bpf code (like how the inline code is generated for
> > the map_lookup helper).  The xdp prog can then store the hwstamp in the meta
> > area in any layout it wants.
>
> Since you mentioned it... FWIW that was always my preference rather than
> the BTF magic :)  The jited image would have to be per-driver like we
> do for BPF offload but that's easy to do from the technical
> perspective (I doubt many deployments bind the same prog to multiple
> HW devices)..

+1, sounds like a good alternative (got your reply while typing)
I'm not too versed in the rx_desc/rx_queue area, but seems like worst
case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
parse it out from the pre-populated metadata?

Btw, do we also need to think about the redirect case? What happens
when I redirect one frame from a device A with one metadata format to
a device B with another?
Jakub Kicinski Oct. 5, 2022, 1:24 a.m. UTC | #9
On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote:
> +1, sounds like a good alternative (got your reply while typing)
> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
> parse it out from the pre-populated metadata?

I'd think so, worst case the driver can put xdp_md into a struct 
and container_of() to get to its own stack with whatever fields
it needs.

> Btw, do we also need to think about the redirect case? What happens
> when I redirect one frame from a device A with one metadata format to
> a device B with another?

If there is a program on Tx then it'd be trivial - just do the 
info <-> descriptor translation in the opposite direction than Rx.
TBH I'm not sure how it'd be done in the current approach, either.
Now I questioned the BTF way and mentioned the Tx-side program in 
a single thread, I better stop talking...
Stanislav Fomichev Oct. 5, 2022, 2:15 a.m. UTC | #10
On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote:
> > +1, sounds like a good alternative (got your reply while typing)
> > I'm not too versed in the rx_desc/rx_queue area, but seems like worst
> > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
> > parse it out from the pre-populated metadata?
>
> I'd think so, worst case the driver can put xdp_md into a struct
> and container_of() to get to its own stack with whatever fields
> it needs.

Ack, seems like something worth exploring then.

The only issue I see with that is that we'd probably have to extend
the loading api to pass target xdp device so we can pre-generate
per-device bytecode for those kfuncs? And this potentially will block
attaching the same program to different drivers/devices?
Or, Martin, did you maybe have something better in mind?

> > Btw, do we also need to think about the redirect case? What happens
> > when I redirect one frame from a device A with one metadata format to
> > a device B with another?
>
> If there is a program on Tx then it'd be trivial - just do the
> info <-> descriptor translation in the opposite direction than Rx.
> TBH I'm not sure how it'd be done in the current approach, either.

Yeah, I don't think it magically works in any case. I'm just trying to
understand whether it's something we care to support out of the box or
can punt to the bpf programs themselves and say "if you care about
forwarding metadata, somehow agree on the format yourself".

> Now I questioned the BTF way and mentioned the Tx-side program in
> a single thread, I better stop talking...

Forget about btf, hail to the new king - kfunc :-D
Toke Høiland-Jørgensen Oct. 5, 2022, 10:06 a.m. UTC | #11
Stanislav Fomichev <sdf@google.com> writes:

> On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote:
>> > A intentionally wild question, what does it take for the driver to return the
>> > hints.  Is the rx_desc and rx_queue enough?  When the xdp prog is calling a
>> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver
>> > replace it with some inline bpf code (like how the inline code is generated for
>> > the map_lookup helper).  The xdp prog can then store the hwstamp in the meta
>> > area in any layout it wants.
>>
>> Since you mentioned it... FWIW that was always my preference rather than
>> the BTF magic :)  The jited image would have to be per-driver like we
>> do for BPF offload but that's easy to do from the technical
>> perspective (I doubt many deployments bind the same prog to multiple
>> HW devices)..
>
> +1, sounds like a good alternative (got your reply while typing)
> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
> parse it out from the pre-populated metadata?
>
> Btw, do we also need to think about the redirect case? What happens
> when I redirect one frame from a device A with one metadata format to
> a device B with another?

Yes, we absolutely do! In fact, to me this (redirects) is the main
reason why we need the ID in the packet in the first place: when running
on (say) a veth, an XDP program needs to be able to deal with packets
from multiple physical NICs.

As far as API is concerned, my hope was that we could solve this with a
CO-RE like approach where the program author just writes something like:

hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64);

and bpf_get_xdp_hint() is really a macro (or a special kind of
relocation?) and libbpf would do the following on load:

- query the kernel BTF for all possible xdp_hint structs
- figure out which of them have an 'u64 hw_tstamp' member
- generate the necessary conditionals / jump table to disambiguate on
  the BTF_ID in the packet


Now, if this is better done by a kfunc I'm not terribly opposed to that
either, but I'm not sure it's actually better/easier to do in the kernel
than in libbpf at load time?

-Toke
Burakov, Anatoly Oct. 5, 2022, 1:14 p.m. UTC | #12
On 04-Oct-22 10:29 AM, Jesper Dangaard Brouer wrote:
> 
> On 04/10/2022 01.55, sdf@google.com wrote:
>> On 09/07, Jesper Dangaard Brouer wrote:
>>> This patchset expose the traditional hardware offload hints to XDP and
>>> rely on BTF to expose the layout to users.
>>
>>> Main idea is that the kernel and NIC drivers simply defines the struct
>>> layouts they choose to use for XDP-hints. These XDP-hints structs gets
>>> naturally and automatically described via BTF and implicitly exported to
>>> users. NIC drivers populate and records their own BTF ID as the last
>>> member in XDP metadata area (making it easily accessible by AF_XDP
>>> userspace at a known negative offset from packet data start).
>>
>>> Naming conventions for the structs (xdp_hints_*) is used such that
>>> userspace can find and decode the BTF layout and match against the
>>> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
>>> what XDP-hints a driver supports.
>>
>>> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
>>> union named "xdp_hints_union" in every driver, which contains all
>>> xdp_hints_* struct this driver can support. This makes it easier/quicker
>>> to find and parse the relevant BTF types.  (Seeking input before fixing
>>> up all drivers in patchset).
>>
>>
>>> The main different from RFC-v1:
>>>   - Drop idea of BTF "origin" (vmlinux, module or local)
>>>   - Instead to use full 64-bit BTF ID that combine object+type ID
>>
>>> I've taken some of Alexandr/Larysa's libbpf patches and integrated
>>> those.
>>
>>> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
>>> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
>>> required some refactoring to remove the SKB dependencies.
>>
>> Hey Jesper,
>>
>> I took a quick look at the series. 
> Appreciate that! :-)
> 
>> Do we really need the enum with the flags?
> 
> The primary reason for using enum is that these gets exposed as BTF.
> The proposal is that userspace/BTF need to obtain the flags via BTF,
> such that they don't become UAPI, but something we can change later.
> 
>> We might eventually hit that "first 16 bits are reserved" issue?
>>
>> Instead of exposing enum with the flags, why not solve it as follows:
>> a. We define UAPI struct xdp_rx_hints with _all_ possible hints
> 
> How can we know _all_ possible hints from the beginning(?).
> 
> UAPI + central struct dictating all possible hints, will limit innovation.
> 
>> b. Each device defines much denser <device>_xdp_rx_hints struct with the
>>     metadata that it supports
> 
> Thus, the NIC device is limited to what is defined in UAPI struct
> xdp_rx_hints.  Again this limits innovation.
> 
>> c. The subset of fields in <device>_xdp_rx_hints should match the ones 
>> from
>>     xdp_rx_hints (we essentially standardize on the field names/sizes)
>> d. We expose <device>_xdp_rx_hints btf id via netlink for each device
> 
> For this proposed design you would still need more than one BTF ID or
> <device>_xdp_rx_hints struct's, because not all packets contains all
> hints. The most common case is HW timestamping, which some HW only
> supports for PTP frames.
> 
> Plus, I don't see a need to expose anything via netlink, as we can just
> use the existing BTF information from the module.  Thus, avoiding to
> creating more UAPI.
> 
>> e. libbpf will query and do offset relocations for
>>     xdp_rx_hints -> <device>_xdp_rx_hints at load time
>>
>> Would that work? Then it seems like we can replace bitfields with the 
> 
> I used to be a fan of bitfields, until I discovered that they are bad
> for performance, because compilers cannot optimize these.
> 
>> following:
>>
>>    if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
>>      /* use that hint */
> 
> Fairly often a VLAN will not be set in packets, so we still have to read
> and check a bitfield/flag if the VLAN value is valid. (Guess it is
> implicit in above code).
> 
>>    }
>>
>> All we need here is for libbpf to, again, do xdp_rx_hints ->
>> <device>_xdp_rx_hints translation before it evaluates 
>> bpf_core_field_exists()?
>>
>> Thoughts? Any downsides? Am I missing something?
>>
> 
> Well, the downside is primarily that this design limits innovation.
> 
> Each time a NIC driver want to introduce a new hardware hint, they have
> to update the central UAPI xdp_rx_hints struct first.
> 
> The design in the patchset is to open for innovation.  Driver can extend
> their own xdp_hints_<driver>_xxx struct(s).  They still have to land
> their patches upstream, but avoid mangling a central UAPI struct. As
> upstream we review driver changes and should focus on sane struct member
> naming(+size) especially if this "sounds" like a hint/feature that more
> driver are likely to support.  With help from BTF relocations, a new
> driver can support same hint/feature if naming(+size) match (without
> necessary the same offset in the struct).
> 
>> Also, about the TX side: I feel like the same can be applied there,
>> the program works with xdp_tx_hints and libbpf will rewrite to
>> <device>_xdp_tx_hints. xdp_tx_hints might have fields like 
>> "has_tx_vlan:1";
>> those, presumably, can be relocatable by libbpf as well?
>>
> 
> Good to think ahead for TX-side, even-though I think we should focus on
> landing RX-side first.
> 
> I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
> common struct xdp_hints_common, without a RX/TX direction indication.
> Maybe this is wrong of me, but my thinking was that most of the common
> hints can be directly used as TX-side hints.  I'm hoping TX-side
> xdp-hints will need to do little-to-non adjustment, before using the
> hints as TX "instruction".  I'm hoping that XDP-redirect will just work
> and xmit driver can use XDP-hints area.
> 
> Please correct me if I'm wrong.
> The checksum fields hopefully translates to similar TX offload "actions".
> The VLAN offload hint should translate directly to TX-side.

Like I indicated in another response, not necessarily. Rx checksum 
typically indicates that the checksumming was completed and checksum was 
good/bad, but for Tx we actually supply offsets (possibly multiple ones, 
depending on L2/L3/L4 packet, plus there's also a need to distinguish 
between packet types as different NICs will have different offload bits 
for different ptypes) in the metadata. So, while VLAN offload may or may 
not translate directly to the Tx side of things, checksumming probably 
won't.

> 
> I can easily be convinced we should name it xdp_hints_rx_common from the
> start, but then I will propose that xdp_hints_tx_common have the
> checksum and VLAN fields+flags at same locations, such that we don't
> take any performance hint for moving them to "TX-side" hints, making
> XDP-redirect just work.
> 
> --Jesper
>
Jesper Dangaard Brouer Oct. 5, 2022, 1:43 p.m. UTC | #13
On 05/10/2022 02.25, Martin KaFai Lau wrote:
> For rx hwtimestamp, the value could be just 0 if a specific hw/driver 
> cannot provide it for all packets while some other hw can.

Keep in mind that we want to avoid having to write a (64-bit) zero into
the metadata for rx_hwtimestamp, for every packet that doesn't carry a
timestamp.  It essentially reverts back to clearing memory like with
SKBs, due to performance overhead we don't want to go that path again!

There are multiple ways to avoid having to zero init the memory.

In this patchset I have choosen have the traditional approach of flags
(u32) approach located in xdp_hints_common, e.g. setting a flag if the
field is valid (p.s. John Fastabend convinced me of this approach ;-)).
But COMBINED with: some BTF ID layouts doesn't contain some fields e.g.
the rx_timestamp, thus the code have no reason to query those flag fields.


I am intrigued to find a way to leverage bpf_core_field_exists() some
more (as proposed by Stanislav).  (As this can allow for dead-code
elimination).

--Jesper
Jesper Dangaard Brouer Oct. 5, 2022, 2:19 p.m. UTC | #14
On 05/10/2022 03.02, Stanislav Fomichev wrote:
> On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote:
>>> A intentionally wild question, what does it take for the driver to return the
>>> hints.  Is the rx_desc and rx_queue enough?  When the xdp prog is calling a
>>> kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver
>>> replace it with some inline bpf code (like how the inline code is generated for
>>> the map_lookup helper).  The xdp prog can then store the hwstamp in the meta
>>> area in any layout it wants.
>>
>> Since you mentioned it... FWIW that was always my preference rather than
>> the BTF magic :)  The jited image would have to be per-driver like we
>> do for BPF offload but that's easy to do from the technical
>> perspective (I doubt many deployments bind the same prog to multiple
>> HW devices)..

On the technical side we do have the ifindex that can be passed along
which is currently used for getting XDP hardware offloading to work.
But last time I tried this, I failed due to BPF tail call maps.
(It's not going to fly for other reasons, see redirect below).

> 
> +1, sounds like a good alternative (got your reply while typing)
> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
> parse it out from the pre-populated metadata?
> 
> Btw, do we also need to think about the redirect case? What happens
> when I redirect one frame from a device A with one metadata format to
> a device B with another?

Exactly the problem.  With XDP redirect the "remote" target device also
need to interpret this metadata layout.  For RX-side we have the
immediate case with redirecting into a veth device.   For future TX-side
this is likely the same kind of issue, but I hope if we can solve this
for veth redirect use-case, this will keep us future proof.

For veth use-case I hope that we can use same trick as
bpf_core_field_exists() to do dead-code elimination based on if a device
driver is loaded on the system like this pseudo code:


  if (bpf_core_type_id_kernel(struct xdp_hints_i40e_timestamp)) {
    /* check id + extract timestamp */
  }
  if (bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp)) {
    /* check id + extract timestamp */
  }

If the given device drives doesn't exist on the system, I assume
bpf_core_type_id_kernel() will return 0 at libbpf relocation/load-time,
and thus this should cause dead-code elimination.  Should work today AFAIK?

--Jesper
Jesper Dangaard Brouer Oct. 5, 2022, 4:29 p.m. UTC | #15
On 04/10/2022 20.26, Stanislav Fomichev wrote:
> On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>> On 04/10/2022 01.55, sdf@google.com wrote:
>>> On 09/07, Jesper Dangaard Brouer wrote:
>>>> This patchset expose the traditional hardware offload hints to XDP and
>>>> rely on BTF to expose the layout to users.
>>>
>>>> Main idea is that the kernel and NIC drivers simply defines the struct
>>>> layouts they choose to use for XDP-hints. These XDP-hints structs gets
>>>> naturally and automatically described via BTF and implicitly exported to
>>>> users. NIC drivers populate and records their own BTF ID as the last
>>>> member in XDP metadata area (making it easily accessible by AF_XDP
>>>> userspace at a known negative offset from packet data start).
>>>
>>>> Naming conventions for the structs (xdp_hints_*) is used such that
>>>> userspace can find and decode the BTF layout and match against the
>>>> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
>>>> what XDP-hints a driver supports.
>>>
>>>> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
>>>> union named "xdp_hints_union" in every driver, which contains all
>>>> xdp_hints_* struct this driver can support. This makes it easier/quicker
>>>> to find and parse the relevant BTF types.  (Seeking input before fixing
>>>> up all drivers in patchset).
>>>
>>>
>>>> The main different from RFC-v1:
>>>>    - Drop idea of BTF "origin" (vmlinux, module or local)
>>>>    - Instead to use full 64-bit BTF ID that combine object+type ID
>>>
>>>> I've taken some of Alexandr/Larysa's libbpf patches and integrated
>>>> those.
>>>
>>>> Patchset exceeds netdev usually max 15 patches rule. My excuse is three
>>>> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
>>>> required some refactoring to remove the SKB dependencies.
>>>
>>> Hey Jesper,
>>>
>>> I took a quick look at the series.
>> Appreciate that! :-)
>>
>>> Do we really need the enum with the flags?
>>
>> The primary reason for using enum is that these gets exposed as BTF.
>> The proposal is that userspace/BTF need to obtain the flags via BTF,
>> such that they don't become UAPI, but something we can change later.
>>
>>> We might eventually hit that "first 16 bits are reserved" issue?
>>>
>>> Instead of exposing enum with the flags, why not solve it as follows:
>>> a. We define UAPI struct xdp_rx_hints with _all_ possible hints
>>
>> How can we know _all_ possible hints from the beginning(?).
>>
>> UAPI + central struct dictating all possible hints, will limit innovation.
> 
> We don't need to know them all in advance. The same way we don't know
> them all for flags enum. That UAPI xdp_rx_hints can be extended any
> time some driver needs some new hint offload. The benefit here is that
> we have a "common registry" of all offloads and different drivers have
> an opportunity to share.
> 
> Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake
> uapi struct (__sk_buff) and the access to it gets translated into
> <device>_xdp_rx_hints offsets (sk_buff).
> 
>>> b. Each device defines much denser <device>_xdp_rx_hints struct with the
>>>      metadata that it supports
>>
>> Thus, the NIC device is limited to what is defined in UAPI struct
>> xdp_rx_hints.  Again this limits innovation.
> 
> I guess what I'm missing from your series is the bpf/userspace side.
> Do you have an example on the bpf side that will work for, say,
> xdp_hints_ixgbe_timestamp?
> 
> Suppose, you pass this custom hints btf_id via xdp_md as proposed,

I just want to reiterate why we place btf_full_id at the "end inline".
This makes it easily available for AF_XDP to consume.  Plus, we already
have to write info into this metadata cache-line anyway, thus it's
almost free.  Moving bpf_full_id into xdp_md, will require expanding
both xdp_buff and xdp_frame (+ extra store for converting
buff-to-frame). If AF_XDP need this btf_full_id the BPF-prog _could_
move/copy it from xdp_md to metadata, but that will just waste cycles,
why not just store it once in a known location.

One option, for convenience, would be to map xdp_md->bpf_full_id to load
the btf_full_id value from the metadata.  But that would essentially be
syntax-sugar and adds UAPI.

> what's the action on the bpf side to consume this?
> 
> If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly
> populated at runtime by libbpf? */) {

See e.g. bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp)
AFAIK libbpf will make this a constant at load/setup time, and give us
dead-code elimination.

>    // do something with rx_timestamp
>    // also, handle xdp_hints_ixgbe and then xdp_hints_common ?
> } else if (ctx_hints_btf_id == xdp_hints_ixgbe) {
>    // do something else
>    // plus explicitly handle xdp_hints_common here?
> } else {
>    // handle xdp_hints_common
> }

I added a BPF-helper that can tell us if layout if compatible with
xdp_hints_common, which is basically the only UAPI the patchset 
introduces.  The handle xdp_hints_common code should be common.

I'm not super happy with the BPF-helper approach, so suggestions are
welcome.  E.g. xdp_md/ctx->is_hint_common could be one approach and
ctx->has_hint (ctx is often called xdp so it reads xdp->has_hint).

One feature I need from the BPF-helper is to "disable" the xdp_hints and
allow the BPF-prog to use the entire metadata area for something else 
(avoiding it to be misintrepreted by next prog or after redirect).


> 
> What I'd like to avoid is an xdp program targeting specific drivers.
> Where possible, we should aim towards something like "if this device
> has rx_timestamp offload -> use it without depending too much on
> specific btf_ids.
> 

I do understand your wish, and adding rx_timestamps to xdp_hints_common
would be too easy (and IMHO wasting u64/8-bytes for all packets not
needing this timestamp).  Hopefully we can come up with a good solution
together.

One idea would be to extend libbpf to lookup or translate struct name

  struct xdp_hints_DRIVER_timestamp {
    __u64 rx_timestamp;
  } __attribute__((preserve_access_index));

into e.g. xdp_hints_i40e_timestamp, if an ifindex was provided when 
loading the XDP prog.  And the bpf_core_type_id_kernel() result of the 
struct returning id from xdp_hints_i40e_timestamp.

But this ideas doesn't really work for the veth redirect use-case :-(
As veth need to handle xdp_hints from other drivers.

>>> c. The subset of fields in <device>_xdp_rx_hints should match the ones from
>>>      xdp_rx_hints (we essentially standardize on the field names/sizes)
>>> d. We expose <device>_xdp_rx_hints btf id via netlink for each device
>>
>> For this proposed design you would still need more than one BTF ID or
>> <device>_xdp_rx_hints struct's, because not all packets contains all
>> hints. The most common case is HW timestamping, which some HW only
>> supports for PTP frames.
>>
>> Plus, I don't see a need to expose anything via netlink, as we can just
>> use the existing BTF information from the module.  Thus, avoiding to
>> creating more UAPI.
> 
> See above. I think even with your series, that btf_id info should also
> come via netlink so the programs can query it before loading and do
> the required adjustments. Otherwise, I'm not sure I understand what I
> need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too
> late? I need to know them in advance to at least populate those ids
> into the bpf program itself?

Yes, we need to know these IDs in advance and can.  I don't think we 
need the netlink interface, as we can already read out the BTF layout 
and IDs today.  I coded it up in userspace, where the intented consumer 
is AF_XDP (as libbpf already does this itself).

See this code:
  - 
https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_ids.c
  - 
https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c


>>> e. libbpf will query and do offset relocations for
>>>      xdp_rx_hints -> <device>_xdp_rx_hints at load time
>>>
>>> Would that work? Then it seems like we can replace bitfields with the
>>
>> I used to be a fan of bitfields, until I discovered that they are bad
>> for performance, because compilers cannot optimize these.
> 
> Ack, good point, something to keep in mind.
> 
>>> following:
>>>
>>>     if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
>>>       /* use that hint */
>>
>> Fairly often a VLAN will not be set in packets, so we still have to read
>> and check a bitfield/flag if the VLAN value is valid. (Guess it is
>> implicit in above code).
> 
> That's a fair point. Then we need two signals?
> 
> 1. Whether this particular offload is supported for the device at all
> (via that bpf_core_field_exists or something similar)
> 2. Whether this particular packet has particular metadata (via your
> proposed flags)
> 
> if (device I'm attaching xdp to has vlan offload) { // via
> bpf_core_field_exists?
>    if (particular packet comes with a vlan tag) { // via your proposed
> bitfield flags?
>    }
> }
> 
> Or are we assuming that (2) is fast enough and we don't care about
> (1)? Because (1) can 'if (0)' the whole branch and make the verifier
> remove that part.
> 
>>>     }
>>>
>>> All we need here is for libbpf to, again, do xdp_rx_hints ->
>>> <device>_xdp_rx_hints translation before it evaluates
>>> bpf_core_field_exists()?
>>>
>>> Thoughts? Any downsides? Am I missing something?
>>>
>>
>> Well, the downside is primarily that this design limits innovation.
>>
>> Each time a NIC driver want to introduce a new hardware hint, they have
>> to update the central UAPI xdp_rx_hints struct first.
>>
>> The design in the patchset is to open for innovation.  Driver can extend
>> their own xdp_hints_<driver>_xxx struct(s).  They still have to land
>> their patches upstream, but avoid mangling a central UAPI struct. As
>> upstream we review driver changes and should focus on sane struct member
>> naming(+size) especially if this "sounds" like a hint/feature that more
>> driver are likely to support.  With help from BTF relocations, a new
>> driver can support same hint/feature if naming(+size) match (without
>> necessary the same offset in the struct).
> 
> The opposite side of this approach is that we'll have 'ixgbe_hints'
> with 'rx_timestamp' and 'mvneta_hints' with something like
> 'rx_tstamp'.

Well, as I wrote reviewers should ask drivers to use the same member name.

>>> Also, about the TX side: I feel like the same can be applied there,
>>> the program works with xdp_tx_hints and libbpf will rewrite to
>>> <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1";
>>> those, presumably, can be relocatable by libbpf as well?
>>>
>>
>> Good to think ahead for TX-side, even-though I think we should focus on
>> landing RX-side first.
>>
>> I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
>> common struct xdp_hints_common, without a RX/TX direction indication.
>> Maybe this is wrong of me, but my thinking was that most of the common
>> hints can be directly used as TX-side hints.  I'm hoping TX-side
>> xdp-hints will need to do little-to-non adjustment, before using the
>> hints as TX "instruction".  I'm hoping that XDP-redirect will just work
>> and xmit driver can use XDP-hints area.
>>
>> Please correct me if I'm wrong.
>> The checksum fields hopefully translates to similar TX offload "actions".
>> The VLAN offload hint should translate directly to TX-side.
>>
>> I can easily be convinced we should name it xdp_hints_rx_common from the
>> start, but then I will propose that xdp_hints_tx_common have the
>> checksum and VLAN fields+flags at same locations, such that we don't
>> take any performance hint for moving them to "TX-side" hints, making
>> XDP-redirect just work.
> 
> Might be good to think about this beforehand. I agree that most of the
> layout should hopefully match. However once case that I'm interested
> in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp
> in the metadata; for tx, I'm merely setting a flag somewhere to
> request it for async delivery later (I hope we plan to support that
> for af_xdp?). So the layout might be completely different :-(
> 

Yes, it is definitely in my plans to support handling at TX-completion
time, so you can extract the TX-wire-timestamp.  This is easy for AF_XDP
as it has the CQ (Completion Queue) step.

I'm getting ahead of myself, but for XDP I imagine that driver will
populate this xdp_tx_hint in DMA TX-completion function, and we can add
a kfunc "not-a-real-hook" to xdp_return_frame that can run another XDP
BPF-prog that can inspect the xdp_tx_hint in metadata.

At this proposed kfunc xdp_return_frame call point, we likely cannot 
know what driver that produced the xdp_hints metadata either, and thus 
not lock our design or BTF-reloacations to assume which driver is it 
loaded on.

[... cut ... getting too long]

--Jesper
Stanislav Fomichev Oct. 5, 2022, 6:43 p.m. UTC | #16
On 10/05, Jesper Dangaard Brouer wrote:


> On 04/10/2022 20.26, Stanislav Fomichev wrote:
> > On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> > >
> > >
> > > On 04/10/2022 01.55, sdf@google.com wrote:
> > > > On 09/07, Jesper Dangaard Brouer wrote:
> > > > > This patchset expose the traditional hardware offload hints to  
> XDP and
> > > > > rely on BTF to expose the layout to users.
> > > >
> > > > > Main idea is that the kernel and NIC drivers simply defines the  
> struct
> > > > > layouts they choose to use for XDP-hints. These XDP-hints structs  
> gets
> > > > > naturally and automatically described via BTF and implicitly  
> exported to
> > > > > users. NIC drivers populate and records their own BTF ID as the  
> last
> > > > > member in XDP metadata area (making it easily accessible by AF_XDP
> > > > > userspace at a known negative offset from packet data start).
> > > >
> > > > > Naming conventions for the structs (xdp_hints_*) is used such that
> > > > > userspace can find and decode the BTF layout and match against the
> > > > > provided BTF IDs. Thus, no new UAPI interfaces are needed for  
> exporting
> > > > > what XDP-hints a driver supports.
> > > >
> > > > > The patch "i40e: Add xdp_hints_union" introduce the idea of  
> creating a
> > > > > union named "xdp_hints_union" in every driver, which contains all
> > > > > xdp_hints_* struct this driver can support. This makes it  
> easier/quicker
> > > > > to find and parse the relevant BTF types.  (Seeking input before  
> fixing
> > > > > up all drivers in patchset).
> > > >
> > > >
> > > > > The main different from RFC-v1:
> > > > >    - Drop idea of BTF "origin" (vmlinux, module or local)
> > > > >    - Instead to use full 64-bit BTF ID that combine object+type ID
> > > >
> > > > > I've taken some of Alexandr/Larysa's libbpf patches and integrated
> > > > > those.
> > > >
> > > > > Patchset exceeds netdev usually max 15 patches rule. My excuse is  
> three
> > > > > NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and  
> which
> > > > > required some refactoring to remove the SKB dependencies.
> > > >
> > > > Hey Jesper,
> > > >
> > > > I took a quick look at the series.
> > > Appreciate that! :-)
> > >
> > > > Do we really need the enum with the flags?
> > >
> > > The primary reason for using enum is that these gets exposed as BTF.
> > > The proposal is that userspace/BTF need to obtain the flags via BTF,
> > > such that they don't become UAPI, but something we can change later.
> > >
> > > > We might eventually hit that "first 16 bits are reserved" issue?
> > > >
> > > > Instead of exposing enum with the flags, why not solve it as  
> follows:
> > > > a. We define UAPI struct xdp_rx_hints with _all_ possible hints
> > >
> > > How can we know _all_ possible hints from the beginning(?).
> > >
> > > UAPI + central struct dictating all possible hints, will limit  
> innovation.
> >
> > We don't need to know them all in advance. The same way we don't know
> > them all for flags enum. That UAPI xdp_rx_hints can be extended any
> > time some driver needs some new hint offload. The benefit here is that
> > we have a "common registry" of all offloads and different drivers have
> > an opportunity to share.
> >
> > Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake
> > uapi struct (__sk_buff) and the access to it gets translated into
> > <device>_xdp_rx_hints offsets (sk_buff).
> >
> > > > b. Each device defines much denser <device>_xdp_rx_hints struct  
> with the
> > > >      metadata that it supports
> > >
> > > Thus, the NIC device is limited to what is defined in UAPI struct
> > > xdp_rx_hints.  Again this limits innovation.
> >
> > I guess what I'm missing from your series is the bpf/userspace side.
> > Do you have an example on the bpf side that will work for, say,
> > xdp_hints_ixgbe_timestamp?
> >
> > Suppose, you pass this custom hints btf_id via xdp_md as proposed,

> I just want to reiterate why we place btf_full_id at the "end inline".
> This makes it easily available for AF_XDP to consume.  Plus, we already
> have to write info into this metadata cache-line anyway, thus it's
> almost free.  Moving bpf_full_id into xdp_md, will require expanding
> both xdp_buff and xdp_frame (+ extra store for converting
> buff-to-frame). If AF_XDP need this btf_full_id the BPF-prog _could_
> move/copy it from xdp_md to metadata, but that will just waste cycles,
> why not just store it once in a known location.

> One option, for convenience, would be to map xdp_md->bpf_full_id to load
> the btf_full_id value from the metadata.  But that would essentially be
> syntax-sugar and adds UAPI.

> > what's the action on the bpf side to consume this?
> >
> > If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly
> > populated at runtime by libbpf? */) {

> See e.g. bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp)
> AFAIK libbpf will make this a constant at load/setup time, and give us
> dead-code elimination.

Even with bpf_core_type_id_kernel() you still would have the following:

	if (ctx_hints_btf_id == bpf_core_type_id_kernel(struct xdp_hints_ixgbe)) {
	} else if (the same for every driver that has custom hints) {
	}

Toke has a good suggestion on hiding this behind a helper; either
pre-generated on the libbpf side or a kfunc. We should try to hide
this per-device logic if possible; otherwise we'll get to per-device
XDP programs that only work on some special deployments. OTOH, we'll
probably get there with the hints anyway?

> >    // do something with rx_timestamp
> >    // also, handle xdp_hints_ixgbe and then xdp_hints_common ?
> > } else if (ctx_hints_btf_id == xdp_hints_ixgbe) {
> >    // do something else
> >    // plus explicitly handle xdp_hints_common here?
> > } else {
> >    // handle xdp_hints_common
> > }

> I added a BPF-helper that can tell us if layout if compatible with
> xdp_hints_common, which is basically the only UAPI the patchset  
> introduces.
> The handle xdp_hints_common code should be common.

> I'm not super happy with the BPF-helper approach, so suggestions are
> welcome.  E.g. xdp_md/ctx->is_hint_common could be one approach and
> ctx->has_hint (ctx is often called xdp so it reads xdp->has_hint).

> One feature I need from the BPF-helper is to "disable" the xdp_hints and
> allow the BPF-prog to use the entire metadata area for something else
> (avoiding it to be misintrepreted by next prog or after redirect).

As mentioned in the previous emails, let's try to have a bpf side
example/selftest for the next round? I also feel like xdp_hints_common is
a bit distracting. It makes the common case easy and it hides the
discussion/complexity about per-device hints. Maybe we can drop this
common case at all? Why can't every driver has a custom hints struct?
If we agree that naming/size will be the same across them (and review
catches/guaranteed that), why do we even care about having common
xdp_hints_common struct?

> > What I'd like to avoid is an xdp program targeting specific drivers.
> > Where possible, we should aim towards something like "if this device
> > has rx_timestamp offload -> use it without depending too much on
> > specific btf_ids.
> >

> I do understand your wish, and adding rx_timestamps to xdp_hints_common
> would be too easy (and IMHO wasting u64/8-bytes for all packets not
> needing this timestamp).  Hopefully we can come up with a good solution
> together.

> One idea would be to extend libbpf to lookup or translate struct name

>   struct xdp_hints_DRIVER_timestamp {
>     __u64 rx_timestamp;
>   } __attribute__((preserve_access_index));

> into e.g. xdp_hints_i40e_timestamp, if an ifindex was provided when  
> loading
> the XDP prog.  And the bpf_core_type_id_kernel() result of the struct
> returning id from xdp_hints_i40e_timestamp.

> But this ideas doesn't really work for the veth redirect use-case :-(
> As veth need to handle xdp_hints from other drivers.

Agreed. If we want redirect to work, then the parsing should be either
mostly pre-generated by libbpf to include all possible btf ids that
matter; or done similarly by a kfunc. The idea that we can pre-generate
per-device bpf program seems to be out of the window now?

> > > > c. The subset of fields in <device>_xdp_rx_hints should match the  
> ones from
> > > >      xdp_rx_hints (we essentially standardize on the field  
> names/sizes)
> > > > d. We expose <device>_xdp_rx_hints btf id via netlink for each  
> device
> > >
> > > For this proposed design you would still need more than one BTF ID or
> > > <device>_xdp_rx_hints struct's, because not all packets contains all
> > > hints. The most common case is HW timestamping, which some HW only
> > > supports for PTP frames.
> > >
> > > Plus, I don't see a need to expose anything via netlink, as we can  
> just
> > > use the existing BTF information from the module.  Thus, avoiding to
> > > creating more UAPI.
> >
> > See above. I think even with your series, that btf_id info should also
> > come via netlink so the programs can query it before loading and do
> > the required adjustments. Otherwise, I'm not sure I understand what I
> > need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too
> > late? I need to know them in advance to at least populate those ids
> > into the bpf program itself?

> Yes, we need to know these IDs in advance and can.  I don't think we need
> the netlink interface, as we can already read out the BTF layout and IDs
> today.  I coded it up in userspace, where the intented consumer is AF_XDP
> (as libbpf already does this itself).

> See this code:
>   -  
> https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_ids.c
>   -  
> https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c

SG, if we can have some convention on the names where we can reliably
parse out all possible structs with the hints, let's rely solely on
vmlinux+vmlinux module btf.

> > > > e. libbpf will query and do offset relocations for
> > > >      xdp_rx_hints -> <device>_xdp_rx_hints at load time
> > > >
> > > > Would that work? Then it seems like we can replace bitfields with  
> the
> > >
> > > I used to be a fan of bitfields, until I discovered that they are bad
> > > for performance, because compilers cannot optimize these.
> >
> > Ack, good point, something to keep in mind.
> >
> > > > following:
> > > >
> > > >     if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
> > > >       /* use that hint */
> > >
> > > Fairly often a VLAN will not be set in packets, so we still have to  
> read
> > > and check a bitfield/flag if the VLAN value is valid. (Guess it is
> > > implicit in above code).
> >
> > That's a fair point. Then we need two signals?
> >
> > 1. Whether this particular offload is supported for the device at all
> > (via that bpf_core_field_exists or something similar)
> > 2. Whether this particular packet has particular metadata (via your
> > proposed flags)
> >
> > if (device I'm attaching xdp to has vlan offload) { // via
> > bpf_core_field_exists?
> >    if (particular packet comes with a vlan tag) { // via your proposed
> > bitfield flags?
> >    }
> > }
> >
> > Or are we assuming that (2) is fast enough and we don't care about
> > (1)? Because (1) can 'if (0)' the whole branch and make the verifier
> > remove that part.
> >
> > > >     }
> > > >
> > > > All we need here is for libbpf to, again, do xdp_rx_hints ->
> > > > <device>_xdp_rx_hints translation before it evaluates
> > > > bpf_core_field_exists()?
> > > >
> > > > Thoughts? Any downsides? Am I missing something?
> > > >
> > >
> > > Well, the downside is primarily that this design limits innovation.
> > >
> > > Each time a NIC driver want to introduce a new hardware hint, they  
> have
> > > to update the central UAPI xdp_rx_hints struct first.
> > >
> > > The design in the patchset is to open for innovation.  Driver can  
> extend
> > > their own xdp_hints_<driver>_xxx struct(s).  They still have to land
> > > their patches upstream, but avoid mangling a central UAPI struct. As
> > > upstream we review driver changes and should focus on sane struct  
> member
> > > naming(+size) especially if this "sounds" like a hint/feature that  
> more
> > > driver are likely to support.  With help from BTF relocations, a new
> > > driver can support same hint/feature if naming(+size) match (without
> > > necessary the same offset in the struct).
> >
> > The opposite side of this approach is that we'll have 'ixgbe_hints'
> > with 'rx_timestamp' and 'mvneta_hints' with something like
> > 'rx_tstamp'.

> Well, as I wrote reviewers should ask drivers to use the same member name.

SG!

> > > > Also, about the TX side: I feel like the same can be applied there,
> > > > the program works with xdp_tx_hints and libbpf will rewrite to
> > > > <device>_xdp_tx_hints. xdp_tx_hints might have fields  
> like "has_tx_vlan:1";
> > > > those, presumably, can be relocatable by libbpf as well?
> > > >
> > >
> > > Good to think ahead for TX-side, even-though I think we should focus  
> on
> > > landing RX-side first.
> > >
> > > I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
> > > common struct xdp_hints_common, without a RX/TX direction indication.
> > > Maybe this is wrong of me, but my thinking was that most of the common
> > > hints can be directly used as TX-side hints.  I'm hoping TX-side
> > > xdp-hints will need to do little-to-non adjustment, before using the
> > > hints as TX "instruction".  I'm hoping that XDP-redirect will just  
> work
> > > and xmit driver can use XDP-hints area.
> > >
> > > Please correct me if I'm wrong.
> > > The checksum fields hopefully translates to similar TX  
> offload "actions".
> > > The VLAN offload hint should translate directly to TX-side.
> > >
> > > I can easily be convinced we should name it xdp_hints_rx_common from  
> the
> > > start, but then I will propose that xdp_hints_tx_common have the
> > > checksum and VLAN fields+flags at same locations, such that we don't
> > > take any performance hint for moving them to "TX-side" hints, making
> > > XDP-redirect just work.
> >
> > Might be good to think about this beforehand. I agree that most of the
> > layout should hopefully match. However once case that I'm interested
> > in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp
> > in the metadata; for tx, I'm merely setting a flag somewhere to
> > request it for async delivery later (I hope we plan to support that
> > for af_xdp?). So the layout might be completely different :-(
> >

> Yes, it is definitely in my plans to support handling at TX-completion
> time, so you can extract the TX-wire-timestamp.  This is easy for AF_XDP
> as it has the CQ (Completion Queue) step.

> I'm getting ahead of myself, but for XDP I imagine that driver will
> populate this xdp_tx_hint in DMA TX-completion function, and we can add
> a kfunc "not-a-real-hook" to xdp_return_frame that can run another XDP
> BPF-prog that can inspect the xdp_tx_hint in metadata.

Can we also place that xdp_tx_hint somewhere in the completion ring
for AF_XDP to consume?

> At this proposed kfunc xdp_return_frame call point, we likely cannot know
> what driver that produced the xdp_hints metadata either, and thus not lock
> our design or BTF-reloacations to assume which driver is it loaded on.

> [... cut ... getting too long]

> --Jesper
Stanislav Fomichev Oct. 5, 2022, 6:47 p.m. UTC | #17
On 10/05, Toke H�iland-J�rgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:

> > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote:
> >> > A intentionally wild question, what does it take for the driver to  
> return the
> >> > hints.  Is the rx_desc and rx_queue enough?  When the xdp prog is  
> calling a
> >> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the  
> driver
> >> > replace it with some inline bpf code (like how the inline code is  
> generated for
> >> > the map_lookup helper).  The xdp prog can then store the hwstamp in  
> the meta
> >> > area in any layout it wants.
> >>
> >> Since you mentioned it... FWIW that was always my preference rather  
> than
> >> the BTF magic :)  The jited image would have to be per-driver like we
> >> do for BPF offload but that's easy to do from the technical
> >> perspective (I doubt many deployments bind the same prog to multiple
> >> HW devices)..
> >
> > +1, sounds like a good alternative (got your reply while typing)
> > I'm not too versed in the rx_desc/rx_queue area, but seems like worst
> > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
> > parse it out from the pre-populated metadata?
> >
> > Btw, do we also need to think about the redirect case? What happens
> > when I redirect one frame from a device A with one metadata format to
> > a device B with another?

> Yes, we absolutely do! In fact, to me this (redirects) is the main
> reason why we need the ID in the packet in the first place: when running
> on (say) a veth, an XDP program needs to be able to deal with packets
> from multiple physical NICs.

> As far as API is concerned, my hope was that we could solve this with a
> CO-RE like approach where the program author just writes something like:

> hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64);

> and bpf_get_xdp_hint() is really a macro (or a special kind of
> relocation?) and libbpf would do the following on load:

> - query the kernel BTF for all possible xdp_hint structs
> - figure out which of them have an 'u64 hw_tstamp' member
> - generate the necessary conditionals / jump table to disambiguate on
>    the BTF_ID in the packet


> Now, if this is better done by a kfunc I'm not terribly opposed to that
> either, but I'm not sure it's actually better/easier to do in the kernel
> than in libbpf at load time?

Replied in the other thread, but to reiterate here: then btf_id in the
metadata has to stay and we either pre-generate those bpf_get_xdp_hint()
at libbpf or at kfunc load time level as you mention.

But the program essentially has to handle all possible hints' btf ids thrown
at it by the system. Not sure about the performance in this case :-/
Maybe that's something that can be hidden behind "I might receive forwarded
packets and I know how to handle all metadata format" flag? By default,
we'll pre-generate parsing only for that specific device?
Martin KaFai Lau Oct. 5, 2022, 7:26 p.m. UTC | #18
On 10/4/22 7:15 PM, Stanislav Fomichev wrote:
> On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote:
>>> +1, sounds like a good alternative (got your reply while typing)
>>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
>>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
>>> parse it out from the pre-populated metadata?
>>
>> I'd think so, worst case the driver can put xdp_md into a struct
>> and container_of() to get to its own stack with whatever fields
>> it needs.
> 
> Ack, seems like something worth exploring then.
> 
> The only issue I see with that is that we'd probably have to extend
> the loading api to pass target xdp device so we can pre-generate
> per-device bytecode for those kfuncs? 

There is an existing attr->prog_ifindex for dev offload purpose.  May be we can 
re-purpose/re-use some of the offload API.  How this kfunc can be presented also 
needs some thoughts, could be a new ndo_xxx.... not sure.
> And this potentially will block attaching the same program 
 > to different drivers/devices?
> Or, Martin, did you maybe have something better in mind?

If the kfunc/helper is inline, then it will have to be per device.  Unless the 
bpf prog chooses not to inline which could be an option but I am also not sure 
how often the user wants to 'attach' a loaded xdp prog to a different device. 
To some extend, the CO-RE hints-loading-code will have to be per device also, no?

Why I asked the kfunc/helper approach is because, from the set, it seems the 
hints has already been available at the driver.  The specific knowledge that the 
xdp prog missing is how to get the hints from the rx_desc/rx_queue.  The 
straight forward way to me is to make them (rx_desc/rx_queue) available to xdp 
prog and have kfunc/helper to extract the hints from them only if the xdp prog 
needs it.  The xdp prog can selectively get what hints it needs and then 
optionally store them into the meta area in any layout.

NETIF_F_XDP_HINTS_BIT probably won't be needed and one less thing to worry in 
production.

> 
>>> Btw, do we also need to think about the redirect case? What happens
>>> when I redirect one frame from a device A with one metadata format to
>>> a device B with another?
>>
>> If there is a program on Tx then it'd be trivial - just do the
>> info <-> descriptor translation in the opposite direction than Rx.

+1

>> TBH I'm not sure how it'd be done in the current approach, either.

Yeah, I think we need more selftest to show how things work.

> 
> Yeah, I don't think it magically works in any case. I'm just trying to
> understand whether it's something we care to support out of the box or
> can punt to the bpf programs themselves and say "if you care about
> forwarding metadata, somehow agree on the format yourself".
> 
>> Now I questioned the BTF way and mentioned the Tx-side program in
>> a single thread, I better stop talking...
> 
> Forget about btf, hail to the new king - kfunc :-D
Maryam Tahhan Oct. 6, 2022, 8:19 a.m. UTC | #19
On 05/10/2022 19:47, sdf@google.com wrote:
> On 10/05, Toke H�iland-J�rgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
> 
>> > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>> >>
>> >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote:
>> >> > A intentionally wild question, what does it take for the driver 
>> to return the
>> >> > hints.  Is the rx_desc and rx_queue enough?  When the xdp prog is 
>> calling a
>> >> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can 
>> the driver
>> >> > replace it with some inline bpf code (like how the inline code is 
>> generated for
>> >> > the map_lookup helper).  The xdp prog can then store the hwstamp 
>> in the meta
>> >> > area in any layout it wants.
>> >>
>> >> Since you mentioned it... FWIW that was always my preference rather 
>> than
>> >> the BTF magic :)  The jited image would have to be per-driver like we
>> >> do for BPF offload but that's easy to do from the technical
>> >> perspective (I doubt many deployments bind the same prog to multiple
>> >> HW devices)..
>> >
>> > +1, sounds like a good alternative (got your reply while typing)
>> > I'm not too versed in the rx_desc/rx_queue area, but seems like worst
>> > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
>> > parse it out from the pre-populated metadata?
>> >
>> > Btw, do we also need to think about the redirect case? What happens
>> > when I redirect one frame from a device A with one metadata format to
>> > a device B with another?
> 
>> Yes, we absolutely do! In fact, to me this (redirects) is the main
>> reason why we need the ID in the packet in the first place: when running
>> on (say) a veth, an XDP program needs to be able to deal with packets
>> from multiple physical NICs.
> 
>> As far as API is concerned, my hope was that we could solve this with a
>> CO-RE like approach where the program author just writes something like:
> 
>> hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64);
> 
>> and bpf_get_xdp_hint() is really a macro (or a special kind of
>> relocation?) and libbpf would do the following on load:
> 
>> - query the kernel BTF for all possible xdp_hint structs
>> - figure out which of them have an 'u64 hw_tstamp' member
>> - generate the necessary conditionals / jump table to disambiguate on
>>    the BTF_ID in the packet
> 
> 
>> Now, if this is better done by a kfunc I'm not terribly opposed to that
>> either, but I'm not sure it's actually better/easier to do in the kernel
>> than in libbpf at load time?
> 
> Replied in the other thread, but to reiterate here: then btf_id in the
> metadata has to stay and we either pre-generate those bpf_get_xdp_hint()
> at libbpf or at kfunc load time level as you mention.
> 
> But the program essentially has to handle all possible hints' btf ids 
> thrown
> at it by the system. Not sure about the performance in this case :-/
> Maybe that's something that can be hidden behind "I might receive forwarded
> packets and I know how to handle all metadata format" flag? By default,
> we'll pre-generate parsing only for that specific device?

I did a simple POC of Jespers xdp-hints with AF-XDP and CNDP (Cloud 
Native Data Plane). In the cases where my app had access to the HW I 
didn't need to handle all possible hints... I knew what Drivers were on 
the system and they were the hints I needed to deal with.

So at program init time I registered the relevant BTF_IDs (and some 
callback functions to handle them) from the NICs that were available to 
me in a simple tailq (tbh there were so few I could've probably used a 
static array).

When processing the hints then I only needed to invoke the appropriate 
callback function based on the received BTF_ID. I didn't have a massive 
chains of if...else if... else statements.

In the case where we have redirection to a virtual NIC and we don't 
necessarily know the underlying hints that are exposed to the app, could 
we not still use the xdp_hints (as proposed by Jesper) themselves to 
indicate the relevant drivers to the application? or even indicate them 
via a map or something?
Magnus Karlsson Oct. 6, 2022, 9:14 a.m. UTC | #20
On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/4/22 7:15 PM, Stanislav Fomichev wrote:
> > On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote:
> >>> +1, sounds like a good alternative (got your reply while typing)
> >>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
> >>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
> >>> parse it out from the pre-populated metadata?
> >>
> >> I'd think so, worst case the driver can put xdp_md into a struct
> >> and container_of() to get to its own stack with whatever fields
> >> it needs.
> >
> > Ack, seems like something worth exploring then.
> >
> > The only issue I see with that is that we'd probably have to extend
> > the loading api to pass target xdp device so we can pre-generate
> > per-device bytecode for those kfuncs?
>
> There is an existing attr->prog_ifindex for dev offload purpose.  May be we can
> re-purpose/re-use some of the offload API.  How this kfunc can be presented also
> needs some thoughts, could be a new ndo_xxx.... not sure.
> > And this potentially will block attaching the same program
>  > to different drivers/devices?
> > Or, Martin, did you maybe have something better in mind?
>
> If the kfunc/helper is inline, then it will have to be per device.  Unless the
> bpf prog chooses not to inline which could be an option but I am also not sure
> how often the user wants to 'attach' a loaded xdp prog to a different device.
> To some extend, the CO-RE hints-loading-code will have to be per device also, no?
>
> Why I asked the kfunc/helper approach is because, from the set, it seems the
> hints has already been available at the driver.  The specific knowledge that the
> xdp prog missing is how to get the hints from the rx_desc/rx_queue.  The
> straight forward way to me is to make them (rx_desc/rx_queue) available to xdp
> prog and have kfunc/helper to extract the hints from them only if the xdp prog
> needs it.  The xdp prog can selectively get what hints it needs and then
> optionally store them into the meta area in any layout.

This sounds like a really good idea to me, well worth exploring. To
only have to pay, performance wise, for the metadata you actually use
is very important. I did some experiments [1] on the previous patch
set of Jesper's and there is substantial overhead added for each
metadata enabled (and fetched from the NIC). This is especially
important for AF_XDP in zero-copy mode where most packets are directed
to user-space (if not, you should be using the regular driver that is
optimized for passing packets to the stack or redirecting to other
devices). In this case, the user knows exactly what metadata it wants
and where in the metadata area it should be located in order to offer
the best performance for the application in question. But as you say,
your suggestion could potentially offer a good performance upside to
the regular XDP path too.

[1] https://lore.kernel.org/bpf/CAJ8uoz1XVqVCpkKo18qbkh6jq_Lejk24OwEWCB9cWhokYLEBDQ@mail.gmail.com/


> NETIF_F_XDP_HINTS_BIT probably won't be needed and one less thing to worry in
> production.
>
> >
> >>> Btw, do we also need to think about the redirect case? What happens
> >>> when I redirect one frame from a device A with one metadata format to
> >>> a device B with another?
> >>
> >> If there is a program on Tx then it'd be trivial - just do the
> >> info <-> descriptor translation in the opposite direction than Rx.
>
> +1
>
> >> TBH I'm not sure how it'd be done in the current approach, either.
>
> Yeah, I think we need more selftest to show how things work.
>
> >
> > Yeah, I don't think it magically works in any case. I'm just trying to
> > understand whether it's something we care to support out of the box or
> > can punt to the bpf programs themselves and say "if you care about
> > forwarding metadata, somehow agree on the format yourself".
> >
> >> Now I questioned the BTF way and mentioned the Tx-side program in
> >> a single thread, I better stop talking...
> >
> > Forget about btf, hail to the new king - kfunc :-D
>
Jakub Kicinski Oct. 6, 2022, 2:59 p.m. UTC | #21
On Wed, 5 Oct 2022 16:19:30 +0200 Jesper Dangaard Brouer wrote:
> >> Since you mentioned it... FWIW that was always my preference rather than
> >> the BTF magic :)  The jited image would have to be per-driver like we
> >> do for BPF offload but that's easy to do from the technical
> >> perspective (I doubt many deployments bind the same prog to multiple
> >> HW devices)..  
> 
> On the technical side we do have the ifindex that can be passed along
> which is currently used for getting XDP hardware offloading to work.
> But last time I tried this, I failed due to BPF tail call maps.

FWIW the tail call map should be solvable by enforcing that the map 
is also pinned and so are all the programs in it. Perhaps I find that
less ugly than others.. since that's what the offload path did :)

> (It's not going to fly for other reasons, see redirect below).
Jesper Dangaard Brouer Oct. 6, 2022, 3:29 p.m. UTC | #22
On 06/10/2022 11.14, Magnus Karlsson wrote:
> On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/4/22 7:15 PM, Stanislav Fomichev wrote:
>>> On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>
>>>> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote:
>>>>> +1, sounds like a good alternative (got your reply while typing)
>>>>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
>>>>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
>>>>> parse it out from the pre-populated metadata?
>>>>
>>>> I'd think so, worst case the driver can put xdp_md into a struct
>>>> and container_of() to get to its own stack with whatever fields
>>>> it needs.
>>>
>>> Ack, seems like something worth exploring then.
>>>
>>> The only issue I see with that is that we'd probably have to extend
>>> the loading api to pass target xdp device so we can pre-generate
>>> per-device bytecode for those kfuncs?
>>
>> There is an existing attr->prog_ifindex for dev offload purpose.  May be we can
>> re-purpose/re-use some of the offload API.  How this kfunc can be presented also
>> needs some thoughts, could be a new ndo_xxx.... not sure.
>>> And this potentially will block attaching the same program
>>   > to different drivers/devices?
>>> Or, Martin, did you maybe have something better in mind?
>>
>> If the kfunc/helper is inline, then it will have to be per device.  Unless the
>> bpf prog chooses not to inline which could be an option but I am also not sure
>> how often the user wants to 'attach' a loaded xdp prog to a different device.
>> To some extend, the CO-RE hints-loading-code will have to be per device also, no?
>>
>> Why I asked the kfunc/helper approach is because, from the set, it seems the
>> hints has already been available at the driver.  The specific knowledge that the
>> xdp prog missing is how to get the hints from the rx_desc/rx_queue.  The
>> straight forward way to me is to make them (rx_desc/rx_queue) available to xdp
>> prog and have kfunc/helper to extract the hints from them only if the xdp prog
>> needs it.  The xdp prog can selectively get what hints it needs and then
>> optionally store them into the meta area in any layout.
> 
> This sounds like a really good idea to me, well worth exploring. To
> only have to pay, performance wise, for the metadata you actually use
> is very important. I did some experiments [1] on the previous patch
> set of Jesper's and there is substantial overhead added for each
> metadata enabled (and fetched from the NIC). This is especially
> important for AF_XDP in zero-copy mode where most packets are directed
> to user-space (if not, you should be using the regular driver that is
> optimized for passing packets to the stack or redirecting to other
> devices). In this case, the user knows exactly what metadata it wants
> and where in the metadata area it should be located in order to offer
> the best performance for the application in question. But as you say,
> your suggestion could potentially offer a good performance upside to
> the regular XDP path too.

Okay, lets revisit this again.  And let me explain why I believe this
isn't going to fly.

I was also my initial though, lets just give XDP BPF-prog direct access
to the NIC rx_descriptor, or another BPF-prog populate XDP-hints prior
to calling XDP-prog.  Going down this path (previously) I learned three
things:

(1) Understanding/decoding rx_descriptor requires access to the
programmers datasheet, because it is very compacted and the mean of the
bits depend on other bits and plus current configuration status of the HW.

(2) HW have bugs and for certain chip revisions driver will skip some
offload hints.  Thus, chip revisions need to be exported to BPF-progs
and handled appropriately.

(3) Sometimes the info is actually not available in the rx_descriptor.
Often for HW timestamps, the timestamp need to be read from a HW
register.  How do we expose this to the BPF-prog?

> [1] https://lore.kernel.org/bpf/CAJ8uoz1XVqVCpkKo18qbkh6jq_Lejk24OwEWCB9cWhokYLEBDQ@mail.gmail.com/


Notice that this patchset doesn't block this idea, as it is orthogonal.
After we have established a way to express xdp_hints layouts via BTF,
then we can still add a pre-XDP BPF-prog that populates the XDP-hints,
and squeeze out more performance by skipping some of the offloads that
your-specific-XDP-prog are not interested in.

--Jesper
Stanislav Fomichev Oct. 6, 2022, 5:22 p.m. UTC | #23
On 10/06, Maryam Tahhan wrote:
> On 05/10/2022 19:47, sdf@google.com wrote:
> > On 10/05, Toke H�iland-J�rgensen wrote:
> > > Stanislav Fomichev <sdf@google.com> writes:
> >
> > > > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org>  
> wrote:
> > > >>
> > > >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote:
> > > >> > A intentionally wild question, what does it take for the driver
> > > to return the
> > > >> > hints.  Is the rx_desc and rx_queue enough?  When the xdp prog
> > > is calling a
> > > >> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can
> > > the driver
> > > >> > replace it with some inline bpf code (like how the inline code
> > > is generated for
> > > >> > the map_lookup helper).  The xdp prog can then store the
> > > hwstamp in the meta
> > > >> > area in any layout it wants.
> > > >>
> > > >> Since you mentioned it... FWIW that was always my preference
> > > rather than
> > > >> the BTF magic :)  The jited image would have to be per-driver like  
> we
> > > >> do for BPF offload but that's easy to do from the technical
> > > >> perspective (I doubt many deployments bind the same prog to  
> multiple
> > > >> HW devices)..
> > > >
> > > > +1, sounds like a good alternative (got your reply while typing)
> > > > I'm not too versed in the rx_desc/rx_queue area, but seems like  
> worst
> > > > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
> > > > parse it out from the pre-populated metadata?
> > > >
> > > > Btw, do we also need to think about the redirect case? What happens
> > > > when I redirect one frame from a device A with one metadata format  
> to
> > > > a device B with another?
> >
> > > Yes, we absolutely do! In fact, to me this (redirects) is the main
> > > reason why we need the ID in the packet in the first place: when  
> running
> > > on (say) a veth, an XDP program needs to be able to deal with packets
> > > from multiple physical NICs.
> >
> > > As far as API is concerned, my hope was that we could solve this with  
> a
> > > CO-RE like approach where the program author just writes something  
> like:
> >
> > > hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64);
> >
> > > and bpf_get_xdp_hint() is really a macro (or a special kind of
> > > relocation?) and libbpf would do the following on load:
> >
> > > - query the kernel BTF for all possible xdp_hint structs
> > > - figure out which of them have an 'u64 hw_tstamp' member
> > > - generate the necessary conditionals / jump table to disambiguate on
> > >    the BTF_ID in the packet
> >
> >
> > > Now, if this is better done by a kfunc I'm not terribly opposed to  
> that
> > > either, but I'm not sure it's actually better/easier to do in the  
> kernel
> > > than in libbpf at load time?
> >
> > Replied in the other thread, but to reiterate here: then btf_id in the
> > metadata has to stay and we either pre-generate those bpf_get_xdp_hint()
> > at libbpf or at kfunc load time level as you mention.
> >
> > But the program essentially has to handle all possible hints' btf ids
> > thrown
> > at it by the system. Not sure about the performance in this case :-/
> > Maybe that's something that can be hidden behind "I might receive  
> forwarded
> > packets and I know how to handle all metadata format" flag? By default,
> > we'll pre-generate parsing only for that specific device?

> I did a simple POC of Jespers xdp-hints with AF-XDP and CNDP (Cloud Native
> Data Plane). In the cases where my app had access to the HW I didn't need  
> to
> handle all possible hints... I knew what Drivers were on the system and  
> they
> were the hints I needed to deal with.

> So at program init time I registered the relevant BTF_IDs (and some  
> callback
> functions to handle them) from the NICs that were available to me in a
> simple tailq (tbh there were so few I could've probably used a static
> array).

> When processing the hints then I only needed to invoke the appropriate
> callback function based on the received BTF_ID. I didn't have a massive
> chains of if...else if... else statements.

> In the case where we have redirection to a virtual NIC and we don't
> necessarily know the underlying hints that are exposed to the app, could  
> we
> not still use the xdp_hints (as proposed by Jesper) themselves to indicate
> the relevant drivers to the application? or even indicate them via a map  
> or
> something?


Ideally this all should be handled by the common infra (libbpf/libxdp?).
We probably don't want every xdp/af_xdp user to custom-implement all this
btf_id->layout parsing? That's why the request for a selftest that shows
how metadata can be accessed from bpf/af_xdp.
Jesper Dangaard Brouer Oct. 6, 2022, 5:47 p.m. UTC | #24
On 05/10/2022 20.43, sdf@google.com wrote:
> On 10/05, Jesper Dangaard Brouer wrote:
> 
>> On 04/10/2022 20.26, Stanislav Fomichev wrote:
>> > On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer
>> > <jbrouer@redhat.com> wrote:
>> > >
>> > >
>> > > On 04/10/2022 01.55, sdf@google.com wrote:
>> > > > On 09/07, Jesper Dangaard Brouer wrote:
>> > > > > This patchset expose the traditional hardware offload hints to XDP and
>> > > > > rely on BTF to expose the layout to users.
>> > > >
>> > > > > Main idea is that the kernel and NIC drivers simply defines the struct
>> > > > > layouts they choose to use for XDP-hints. These XDP-hints structs gets
>> > > > > naturally and automatically described via BTF and implicitly exported to
>> > > > > users. NIC drivers populate and records their own BTF ID as the last
>> > > > > member in XDP metadata area (making it easily accessible by AF_XDP
>> > > > > userspace at a known negative offset from packet data start).
>> > > >
>> > > > > Naming conventions for the structs (xdp_hints_*) is used such that
>> > > > > userspace can find and decode the BTF layout and match against the
>> > > > > provided BTF IDs. Thus, no new UAPI interfaces are needed for  exporting
>> > > > > what XDP-hints a driver supports.
>> > > >
>> > > > > The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
>> > > > > union named "xdp_hints_union" in every driver, which contains all
>> > > > > xdp_hints_* struct this driver can support. This makes it easier/quicker
>> > > > > to find and parse the relevant BTF types.  (Seeking input before fixing
>> > > > > up all drivers in patchset).
>> > > >
[...]
>> >
>> > > > b. Each device defines much denser <device>_xdp_rx_hints struct with the
>> > > >      metadata that it supports
>> > >
>> > > Thus, the NIC device is limited to what is defined in UAPI struct
>> > > xdp_rx_hints.  Again this limits innovation.
>> >
>> > I guess what I'm missing from your series is the bpf/userspace side.
>> > Do you have an example on the bpf side that will work for, say,
>> > xdp_hints_ixgbe_timestamp?

We have been consuming this from AF_XDP and decoding BTF in userspace
and checking BTF IDs in our userspace apps.  I will try to codeup
consuming this from XDP BPF-progs to get a better feel for that.

>> >
>> > Suppose, you pass this custom hints btf_id via xdp_md as proposed,
> 
>> I just want to reiterate why we place btf_full_id at the "end inline".
>> This makes it easily available for AF_XDP to consume.  Plus, we already
>> have to write info into this metadata cache-line anyway, thus it's
>> almost free.  Moving bpf_full_id into xdp_md, will require expanding
>> both xdp_buff and xdp_frame (+ extra store for converting
>> buff-to-frame). If AF_XDP need this btf_full_id the BPF-prog _could_
>> move/copy it from xdp_md to metadata, but that will just waste cycles,
>> why not just store it once in a known location.
> 
>> One option, for convenience, would be to map xdp_md->bpf_full_id to load
>> the btf_full_id value from the metadata.  But that would essentially be
>> syntax-sugar and adds UAPI.
> 
>> > what's the action on the bpf side to consume this?
>> >
>> > If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly
>> > populated at runtime by libbpf? */) {
> 
>> See e.g. bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp)
>> AFAIK libbpf will make this a constant at load/setup time, and give us
>> dead-code elimination.
> 
> Even with bpf_core_type_id_kernel() you still would have the following:
> 
>      if (ctx_hints_btf_id == bpf_core_type_id_kernel(struct xdp_hints_ixgbe)) {
>      } else if (the same for every driver that has custom hints) {
>      }
> 
> Toke has a good suggestion on hiding this behind a helper; either
> pre-generated on the libbpf side or a kfunc. We should try to hide
> this per-device logic if possible; otherwise we'll get to per-device
> XDP programs that only work on some special deployments. 
> OTOH, we'll probably get there with the hints anyway?

Well yes, hints is trying to let NIC driver innovate and export HW hints
that are specific for a given driver.  Thus, we should allow code to get
device specific hints.

I do like this idea of hiding this behind something. Like libbpf could
detect this and apply CO-RE tricks, e.g. based on the struct name
starting with xdp_rx_hints___xxx and member rx_timestamp, it could scan
entire system (all loaded modules) for xdp_rx_hints_* structs and find
those that contain member rx_timestamp, and then expand that to the
if-else-if statements matching against IDs and access rx_timestamp at
correct offset.
   Unfortunately this auto expansion will add code that isn't needed for
a XDP BPF-prog loaded on a specific physical device (as some IDs will
not be able to appear). For the veth case it is useful. Going back to
ifindex, if a XDP BPF-prog do have ifindex, then we could limit the
expansion to BTF layouts from that driver.  It just feels like a lot of
syntax-sugar and magic to hide the driver name e.g.
"xdp_hints_ixgbe_timestamp" in the C-code.


>> >    // do something with rx_timestamp
>> >    // also, handle xdp_hints_ixgbe and then xdp_hints_common ?
>> > } else if (ctx_hints_btf_id == xdp_hints_ixgbe) {
>> >    // do something else
>> >    // plus explicitly handle xdp_hints_common here?
>> > } else {
>> >    // handle xdp_hints_common
>> > }
> 
>> I added a BPF-helper that can tell us if layout if compatible with
>> xdp_hints_common, which is basically the only UAPI the patchset 
>> introduces.
>> The handle xdp_hints_common code should be common.
>> 
>> I'm not super happy with the BPF-helper approach, so suggestions are
>> welcome.  E.g. xdp_md/ctx->is_hint_common could be one approach and
>> ctx->has_hint (ctx is often called xdp so it reads xdp->has_hint).
>> 
>> One feature I need from the BPF-helper is to "disable" the xdp_hints and
>> allow the BPF-prog to use the entire metadata area for something else
>> (avoiding it to be misintrepreted by next prog or after redirect).
>> 
> As mentioned in the previous emails, let's try to have a bpf side
> example/selftest for the next round? 

Yes, I do need to add BPF-prog examples and selftests.

I am considering sending next round (still as RFC) without this, to show
what Maryam and Magnus settled on for AF_XDP desc option flags.


> I also feel like xdp_hints_common is
> a bit distracting. It makes the common case easy and it hides the
> discussion/complexity about per-device hints. Maybe we can drop this
> common case at all? Why can't every driver has a custom hints struct?
> If we agree that naming/size will be the same across them (and review
> catches/guaranteed that), why do we even care about having common
> xdp_hints_common struct?

The xdp_hints_common struct is a stepping stone to making this easily
consumable from C-code that need to generate SKBs and info for
virtio_net 'hdr' desc.

David Ahern have been begging me for years to just add this statically
to xdp_frame.  I have been reluctant, because I think we can come up
with a more flexible (less UAPI fixed) way, that both allows kerne-code
and BPF-prog to access these fields.  I think of this approach as a
compromise between these two users.

  Meaning struct xdp_hints_common can be changed anytime in the kernel
C-code and BPF-prog's must access area via BTF/CO-RE.


>> > What I'd like to avoid is an xdp program targeting specific drivers.
>> > Where possible, we should aim towards something like "if this device
>> > has rx_timestamp offload -> use it without depending too much on
>> > specific btf_ids.
>> >
> 
>> I do understand your wish, and adding rx_timestamps to xdp_hints_common
>> would be too easy (and IMHO wasting u64/8-bytes for all packets not
>> needing this timestamp).  Hopefully we can come up with a good solution
>> together.
> 
>> One idea would be to extend libbpf to lookup or translate struct name
> 
>>   struct xdp_hints_DRIVER_timestamp {
>>     __u64 rx_timestamp;
>>   } __attribute__((preserve_access_index));
> 
>> into e.g. xdp_hints_i40e_timestamp, if an ifindex was provided when 
>> loading
>> the XDP prog.  And the bpf_core_type_id_kernel() result of the struct
>> returning id from xdp_hints_i40e_timestamp.
> 
>> But this ideas doesn't really work for the veth redirect use-case :-(
>> As veth need to handle xdp_hints from other drivers.
> 
> Agreed. If we want redirect to work, then the parsing should be either
> mostly pre-generated by libbpf to include all possible btf ids that
> matter; or done similarly by a kfunc. The idea that we can pre-generate
> per-device bpf program seems to be out of the window now?
> 

Hmm, the per-device thing could be an optimization that is performed if
an ifindex have been provided.

BUT for redirect to work, we do need to have the full BTF ID, to
identify structs coming from other device drivers and their BTF layout.
We have mentioned redirect into veth several times, but the same goes 
for redirect into AF_XDP, that needs to identify the BTF layout.

[...]
>> > See above. I think even with your series, that btf_id info should also
>> > come via netlink so the programs can query it before loading and do
>> > the required adjustments. Otherwise, I'm not sure I understand what I
>> > need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too
>> > late? I need to know them in advance to at least populate those ids
>> > into the bpf program itself?
> 
>> Yes, we need to know these IDs in advance and can.  I don't think we need
>> the netlink interface, as we can already read out the BTF layout and IDs
>> today.  I coded it up in userspace, where the intented consumer is AF_XDP
>> (as libbpf already does this itself).
> 
>> See this code:
>>   - 
>> https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_ids.c
>>   - 
>> https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c
> 
> SG, if we can have some convention on the names where we can reliably
> parse out all possible structs with the hints, let's rely solely on
> vmlinux+vmlinux module btf.
> 

Yes, I am proposing convention on the struct BTF names to find 
'xdp_hints_*' that the driver can produce.

To make it quicker to find xdp_hints struct in a driver, I am also 
proposing a 'union' that contains all the xdp_hints struct's.

  - See "[PATCH 14/18] i40e: Add xdp_hints_union".

The BTF effect of this is that each driver will have a xdp_hints_union 
with same "name".  That points to all the other BTF IDs.

I am wondering if we can leverage this for CO-RE relocations too.
Then you can define your BPF-prog shadow union with the member 
rx_timestamp (and __attribute__((preserve_access_index))) and let 
CO-RE/libbpf do the offset adjustments. (But again we are back to which 
driver BPF-prog are attached on and veth having to handle all possible 
drivers)


[...]
>> > > >
>> > > > All we need here is for libbpf to, again, do xdp_rx_hints ->
>> > > > <device>_xdp_rx_hints translation before it evaluates
>> > > > bpf_core_field_exists()?
>> > > >
>> > > > Thoughts? Any downsides? Am I missing something?
>> > > >
>> > >
>> > > Well, the downside is primarily that this design limits innovation.
>> > >
>> > > Each time a NIC driver want to introduce a new hardware hint, they have
>> > > to update the central UAPI xdp_rx_hints struct first.
>> > >
>> > > The design in the patchset is to open for innovation.  Driver can extend
>> > > their own xdp_hints_<driver>_xxx struct(s).  They still have to land
>> > > their patches upstream, but avoid mangling a central UAPI struct. As
>> > > upstream we review driver changes and should focus on sane struct member
>> > > naming(+size) especially if this "sounds" like a hint/feature that more
>> > > driver are likely to support.  With help from BTF relocations, a new
>> > > driver can support same hint/feature if naming(+size) match (without
>> > > necessary the same offset in the struct).
>> >
>> > The opposite side of this approach is that we'll have 'ixgbe_hints'
>> > with 'rx_timestamp' and 'mvneta_hints' with something like
>> > 'rx_tstamp'.
> 
>> Well, as I wrote reviewers should ask drivers to use the same member 
>> name.
> 
> SG!
> 
>> > > > Also, about the TX side: I feel like the same can be applied there,
>> > > > the program works with xdp_tx_hints and libbpf will rewrite to
>> > > > <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1";
>> > > > those, presumably, can be relocatable by libbpf as well?
>> > > >
>> > >
>> > > Good to think ahead for TX-side, even-though I think we should focus on
>> > > landing RX-side first.
>> > >
>> > > I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
>> > > common struct xdp_hints_common, without a RX/TX direction indication.
>> > > Maybe this is wrong of me, but my thinking was that most of the  common
>> > > hints can be directly used as TX-side hints.  I'm hoping TX-side
>> > > xdp-hints will need to do little-to-non adjustment, before using the
>> > > hints as TX "instruction".  I'm hoping that XDP-redirect will just work
>> > > and xmit driver can use XDP-hints area.
>> > >
>> > > Please correct me if I'm wrong.
>> > > The checksum fields hopefully translates to similar TX offload "actions".
>> > > The VLAN offload hint should translate directly to TX-side.
>> > >
>> > > I can easily be convinced we should name it xdp_hints_rx_common from the
>> > > start, but then I will propose that xdp_hints_tx_common have the
>> > > checksum and VLAN fields+flags at same locations, such that we don't
>> > > take any performance hint for moving them to "TX-side" hints, making
>> > > XDP-redirect just work.
>> >
>> > Might be good to think about this beforehand. I agree that most of the
>> > layout should hopefully match. However once case that I'm interested
>> > in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp
>> > in the metadata; for tx, I'm merely setting a flag somewhere to
>> > request it for async delivery later (I hope we plan to support that
>> > for af_xdp?). So the layout might be completely different :-(
>> >
> 
>> Yes, it is definitely in my plans to support handling at TX-completion
>> time, so you can extract the TX-wire-timestamp.  This is easy for AF_XDP
>> as it has the CQ (Completion Queue) step.
> 
>> I'm getting ahead of myself, but for XDP I imagine that driver will
>> populate this xdp_tx_hint in DMA TX-completion function, and we can add
>> a kfunc "not-a-real-hook" to xdp_return_frame that can run another XDP
>> BPF-prog that can inspect the xdp_tx_hint in metadata.
> 
> Can we also place that xdp_tx_hint somewhere in the completion ring
> for AF_XDP to consume?

Yes, that is basically what I said above. This will be automatic/easy
for AF_XDP as it has the CQ (Completion Queue) ring.  The packets in the
completion ring will still contain the metadata area, which could have
been populated with the TX-wire-timestamp.


>> At this proposed kfunc xdp_return_frame call point, we likely cannot know
>> what driver that produced the xdp_hints metadata either, and thus not 
>> lock our design or BTF-reloacations to assume which driver is it loaded on.
> 
>> [... cut ... getting too long]

--Jesper
David Ahern Oct. 7, 2022, 3:05 p.m. UTC | #25
On 10/6/22 11:47 AM, Jesper Dangaard Brouer wrote:
> 
> 
>> I also feel like xdp_hints_common is
>> a bit distracting. It makes the common case easy and it hides the
>> discussion/complexity about per-device hints. Maybe we can drop this
>> common case at all? Why can't every driver has a custom hints struct?
>> If we agree that naming/size will be the same across them (and review
>> catches/guaranteed that), why do we even care about having common
>> xdp_hints_common struct?
> 
> The xdp_hints_common struct is a stepping stone to making this easily
> consumable from C-code that need to generate SKBs and info for
> virtio_net 'hdr' desc.
> 
> David Ahern have been begging me for years to just add this statically
> to xdp_frame.  I have been reluctant, because I think we can come up
> with a more flexible (less UAPI fixed) way, that both allows kerne-code
> and BPF-prog to access these fields.  I think of this approach as a
> compromise between these two users.
> 

Simple implementation for common - standard - networking features; jump
through hoops to use vendor unique features. Isn't that point of
standardization?

There are multiple use cases where vlans and checksumming requests need
to traverse devices on an XDP redirect.
Martin KaFai Lau Oct. 11, 2022, 6:29 a.m. UTC | #26
On 10/6/22 8:29 AM, Jesper Dangaard Brouer wrote:
> 
> On 06/10/2022 11.14, Magnus Karlsson wrote:
>> On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 10/4/22 7:15 PM, Stanislav Fomichev wrote:
>>>> On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>>
>>>>> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote:
>>>>>> +1, sounds like a good alternative (got your reply while typing)
>>>>>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
>>>>>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
>>>>>> parse it out from the pre-populated metadata?
>>>>>
>>>>> I'd think so, worst case the driver can put xdp_md into a struct
>>>>> and container_of() to get to its own stack with whatever fields
>>>>> it needs.
>>>>
>>>> Ack, seems like something worth exploring then.
>>>>
>>>> The only issue I see with that is that we'd probably have to extend
>>>> the loading api to pass target xdp device so we can pre-generate
>>>> per-device bytecode for those kfuncs?
>>>
>>> There is an existing attr->prog_ifindex for dev offload purpose.  May be we can
>>> re-purpose/re-use some of the offload API.  How this kfunc can be presented also
>>> needs some thoughts, could be a new ndo_xxx.... not sure.
>>>> And this potentially will block attaching the same program
>>>   > to different drivers/devices?
>>>> Or, Martin, did you maybe have something better in mind?
>>>
>>> If the kfunc/helper is inline, then it will have to be per device.  Unless the
>>> bpf prog chooses not to inline which could be an option but I am also not sure
>>> how often the user wants to 'attach' a loaded xdp prog to a different device.
>>> To some extend, the CO-RE hints-loading-code will have to be per device also, 
>>> no?
>>>
>>> Why I asked the kfunc/helper approach is because, from the set, it seems the
>>> hints has already been available at the driver.  The specific knowledge that the
>>> xdp prog missing is how to get the hints from the rx_desc/rx_queue.  The
>>> straight forward way to me is to make them (rx_desc/rx_queue) available to xdp
>>> prog and have kfunc/helper to extract the hints from them only if the xdp prog
>>> needs it.  The xdp prog can selectively get what hints it needs and then
>>> optionally store them into the meta area in any layout.
>>
>> This sounds like a really good idea to me, well worth exploring. To
>> only have to pay, performance wise, for the metadata you actually use
>> is very important. I did some experiments [1] on the previous patch
>> set of Jesper's and there is substantial overhead added for each
>> metadata enabled (and fetched from the NIC). This is especially
>> important for AF_XDP in zero-copy mode where most packets are directed
>> to user-space (if not, you should be using the regular driver that is
>> optimized for passing packets to the stack or redirecting to other
>> devices). In this case, the user knows exactly what metadata it wants
>> and where in the metadata area it should be located in order to offer
>> the best performance for the application in question. But as you say,
>> your suggestion could potentially offer a good performance upside to
>> the regular XDP path too.

Yeah, since we are on this flexible hint layout, after reading the replies in 
other threads, now I am also not sure why we need a xdp_hints_common and 
probably I am missing something also.  It seems to be most useful in 
__xdp_build_skb_from_frame.  However, the xdp prog can also fill in the 
xdp_hints_common by itself only when needed instead of having the driver always 
filling it in.

> 
> Okay, lets revisit this again.  And let me explain why I believe this
> isn't going to fly.
> 
> I was also my initial though, lets just give XDP BPF-prog direct access
> to the NIC rx_descriptor, or another BPF-prog populate XDP-hints prior
> to calling XDP-prog.  Going down this path (previously) I learned three
> things:
> 
> (1) Understanding/decoding rx_descriptor requires access to the
> programmers datasheet, because it is very compacted and the mean of the
> bits depend on other bits and plus current configuration status of the HW.
> 
> (2) HW have bugs and for certain chip revisions driver will skip some
> offload hints.  Thus, chip revisions need to be exported to BPF-progs
> and handled appropriately.
> 
> (3) Sometimes the info is actually not available in the rx_descriptor.
> Often for HW timestamps, the timestamp need to be read from a HW
> register.  How do we expose this to the BPF-prog?

hmm.... may be I am missing those hw specific details here.  How would the 
driver handle the above cases and fill in the xdp_hints in the meta?  Can the 
same code be called by the xdp prog?

> 
>> [1] 
>> https://lore.kernel.org/bpf/CAJ8uoz1XVqVCpkKo18qbkh6jq_Lejk24OwEWCB9cWhokYLEBDQ@mail.gmail.com/
> 
> 
> Notice that this patchset doesn't block this idea, as it is orthogonal.
> After we have established a way to express xdp_hints layouts via BTF,
> then we can still add a pre-XDP BPF-prog that populates the XDP-hints,
> and squeeze out more performance by skipping some of the offloads that
> your-specific-XDP-prog are not interested in.
> 
> --Jesper
>
Jesper Dangaard Brouer Oct. 11, 2022, 11:57 a.m. UTC | #27
On 11/10/2022 08.29, Martin KaFai Lau wrote:
> On 10/6/22 8:29 AM, Jesper Dangaard Brouer wrote:
>>
>> On 06/10/2022 11.14, Magnus Karlsson wrote:
>>> On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau 
>>> <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/4/22 7:15 PM, Stanislav Fomichev wrote:
>>>>> On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>>>
>>>>>> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote:
>>>>>>> +1, sounds like a good alternative (got your reply while typing)
>>>>>>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst
>>>>>>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and
>>>>>>> parse it out from the pre-populated metadata?
>>>>>>
>>>>>> I'd think so, worst case the driver can put xdp_md into a struct
>>>>>> and container_of() to get to its own stack with whatever fields
>>>>>> it needs.
>>>>>
>>>>> Ack, seems like something worth exploring then.
>>>>>
>>>>> The only issue I see with that is that we'd probably have to extend
>>>>> the loading api to pass target xdp device so we can pre-generate
>>>>> per-device bytecode for those kfuncs?
>>>>
>>>> There is an existing attr->prog_ifindex for dev offload purpose.  
>>>> May be we can
>>>> re-purpose/re-use some of the offload API.  How this kfunc can be 
>>>> presented also
>>>> needs some thoughts, could be a new ndo_xxx.... not sure.
>>>>> And this potentially will block attaching the same program
>>>>   > to different drivers/devices?
>>>>> Or, Martin, did you maybe have something better in mind?
>>>>
>>>> If the kfunc/helper is inline, then it will have to be per device.  
>>>> Unless the
>>>> bpf prog chooses not to inline which could be an option but I am 
>>>> also not sure
>>>> how often the user wants to 'attach' a loaded xdp prog to a 
>>>> different device.
>>>> To some extend, the CO-RE hints-loading-code will have to be per 
>>>> device also, no?
>>>>
>>>> Why I asked the kfunc/helper approach is because, from the set, it 
>>>> seems the
>>>> hints has already been available at the driver.  The specific 
>>>> knowledge that the
>>>> xdp prog missing is how to get the hints from the rx_desc/rx_queue.  
>>>> The
>>>> straight forward way to me is to make them (rx_desc/rx_queue) 
>>>> available to xdp
>>>> prog and have kfunc/helper to extract the hints from them only if 
>>>> the xdp prog
>>>> needs it.  The xdp prog can selectively get what hints it needs and 
>>>> then
>>>> optionally store them into the meta area in any layout.
>>>
>>> This sounds like a really good idea to me, well worth exploring. To
>>> only have to pay, performance wise, for the metadata you actually use
>>> is very important. I did some experiments [1] on the previous patch
>>> set of Jesper's and there is substantial overhead added for each
>>> metadata enabled (and fetched from the NIC). This is especially
>>> important for AF_XDP in zero-copy mode where most packets are directed
>>> to user-space (if not, you should be using the regular driver that is
>>> optimized for passing packets to the stack or redirecting to other
>>> devices). In this case, the user knows exactly what metadata it wants
>>> and where in the metadata area it should be located in order to offer
>>> the best performance for the application in question. But as you say,
>>> your suggestion could potentially offer a good performance upside to
>>> the regular XDP path too.
> 
> Yeah, since we are on this flexible hint layout, after reading the 
> replies in other threads, now I am also not sure why we need a 
> xdp_hints_common and probably I am missing something also.  It seems to 
> be most useful in __xdp_build_skb_from_frame. However, the xdp prog can 
> also fill in the xdp_hints_common by itself only when needed instead of 
> having the driver always filling it in.
> 

I *want* the XDP-hints to be populated even when no XDP-prog is running.
The xdp_frame *is* the mini-SKB concept. These XDP-hints are about
adding HW offload hints to this mini-SKB, to allow it grow into a
full-SKB with these offloads.

I could add this purely as a netstack feature, via extending xdp_frame
area with a common struct. For XDP-prog access I could extend xdp_md
with fields that gets UAPI rewrite mapped to access these fields. For
the AF_XDP users this data becomes harder to access, but an XDP-prog
could (spend cycles) moving these offloads into the metadata area, but
why not place them there is the first place.

I think the main point is that I don't see the XDP-prog as the primary
consumer of these hints.
One reason/use-case for letting XDP-prog access these hints prior to
creating a full-SKB is to help fixing up (or providing) offload hints.
The mvneta driver patch highlight this as HW have limited hints, which
an XDP-prog can provide prior to calling netstack.

In this patchset I'm trying to balance the different users. And via BTF
I'm trying hard not to create more UAPI (e.g. more fixed fields avail in
xdp_md that we cannot get rid of). And trying to add driver flexibility
on-top of the common struct.  This flexibility seems to be stalling the
patchset as we haven't found the perfect way to express this (yet) given
BTF layout is per driver.


>>
>> Okay, lets revisit this again.  And let me explain why I believe this
>> isn't going to fly.
>>
>> I was also my initial though, lets just give XDP BPF-prog direct access
>> to the NIC rx_descriptor, or another BPF-prog populate XDP-hints prior
>> to calling XDP-prog.  Going down this path (previously) I learned three
>> things:
>>
>> (1) Understanding/decoding rx_descriptor requires access to the
>> programmers datasheet, because it is very compacted and the mean of the
>> bits depend on other bits and plus current configuration status of the 
>> HW.
>>
>> (2) HW have bugs and for certain chip revisions driver will skip some
>> offload hints.  Thus, chip revisions need to be exported to BPF-progs
>> and handled appropriately.
>>
>> (3) Sometimes the info is actually not available in the rx_descriptor.
>> Often for HW timestamps, the timestamp need to be read from a HW
>> register.  How do we expose this to the BPF-prog?
> 
> hmm.... may be I am missing those hw specific details here.  How would 
> the driver handle the above cases and fill in the xdp_hints in the 
> meta?  Can the same code be called by the xdp prog?
>

As I mentioned above, I want the XDP-hints to be populated even when no 
XDP-prog is running. I don't want the dependency on loading an XDP-prog 
to get the hints populated, as e.g. netstack is one of the users.


>>
>> Notice that this patchset doesn't block this idea, as it is orthogonal.
>> After we have established a way to express xdp_hints layouts via BTF,
>> then we can still add a pre-XDP BPF-prog that populates the XDP-hints,
>> and squeeze out more performance by skipping some of the offloads that
>> your-specific-XDP-prog are not interested in.
>>
>> --Jesper
>>
>