Message ID | 166256538687.1434226.15760041133601409770.stgit@firesoul (mailing list archive) |
---|---|
Headers | show |
Series | XDP-hints: XDP gaining access to HW offload hints via BTF | expand |
From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Wed, 07 Sep 2022 17:45:00 +0200 > This patchset expose the traditional hardware offload hints to XDP and > rely on BTF to expose the layout to users. > > Main idea is that the kernel and NIC drivers simply defines the struct > layouts they choose to use for XDP-hints. These XDP-hints structs gets > naturally and automatically described via BTF and implicitly exported to > users. NIC drivers populate and records their own BTF ID as the last > member in XDP metadata area (making it easily accessible by AF_XDP > userspace at a known negative offset from packet data start). > > Naming conventions for the structs (xdp_hints_*) is used such that > userspace can find and decode the BTF layout and match against the > provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting > what XDP-hints a driver supports. > > The patch "i40e: Add xdp_hints_union" introduce the idea of creating a > union named "xdp_hints_union" in every driver, which contains all > xdp_hints_* struct this driver can support. This makes it easier/quicker > to find and parse the relevant BTF types. (Seeking input before fixing > up all drivers in patchset). > > > The main different from RFC-v1: > - Drop idea of BTF "origin" (vmlinux, module or local) > - Instead to use full 64-bit BTF ID that combine object+type ID > > I've taken some of Alexandr/Larysa's libbpf patches and integrated > those. Not sure if it's okay to inform the authors about the fact only after sending? Esp from the eeeh... "incompatible" implementation? I realize it's open code, but this looks sorta depreciatingly. > > Patchset exceeds netdev usually max 15 patches rule. My excuse is three > NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which > required some refactoring to remove the SKB dependencies. > > > --- > > Jesper Dangaard Brouer (10): > net: create xdp_hints_common and set functions > net: add net_device feature flag for XDP-hints > xdp: controlling XDP-hints from BPF-prog via helper > i40e: Refactor i40e_ptp_rx_hwtstamp > i40e: refactor i40e_rx_checksum with helper > bpf: export btf functions for modules > btf: Add helper for kernel modules to lookup full BTF ID > i40e: add XDP-hints handling > net: use XDP-hints in xdp_frame to SKB conversion > i40e: Add xdp_hints_union > > Larysa Zaremba (3): > libbpf: factor out BTF loading from load_module_btfs() > libbpf: try to load vmlinux BTF from the kernel first > libbpf: patch module BTF obj+type ID into BPF insns > > Lorenzo Bianconi (1): > mvneta: add XDP-hints support > > Maryam Tahhan (4): > ixgbe: enable xdp-hints > ixgbe: add rx timestamp xdp hints support > xsk: AF_XDP xdp-hints support in desc options > ixgbe: AF_XDP xdp-hints processing in ixgbe_clean_rx_irq_zc > > > drivers/net/ethernet/intel/i40e/i40e.h | 1 + > drivers/net/ethernet/intel/i40e/i40e_main.c | 22 ++ > drivers/net/ethernet/intel/i40e/i40e_ptp.c | 36 ++- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 252 ++++++++++++++--- > drivers/net/ethernet/intel/ixgbe/ixgbe.h | 5 + > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 217 +++++++++++++-- > drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 82 ++++-- > drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 2 + > drivers/net/ethernet/marvell/mvneta.c | 59 +++- > include/linux/btf.h | 3 + > include/linux/netdev_features.h | 3 +- > include/net/xdp.h | 256 +++++++++++++++++- > include/uapi/linux/bpf.h | 35 +++ > include/uapi/linux/if_xdp.h | 2 +- > kernel/bpf/btf.c | 36 ++- > net/core/filter.c | 52 ++++ > net/core/xdp.c | 22 +- > net/ethtool/common.c | 1 + > net/xdp/xsk.c | 2 +- > net/xdp/xsk_queue.h | 3 +- > tools/lib/bpf/bpf_core_read.h | 3 +- > tools/lib/bpf/btf.c | 142 +++++++++- > tools/lib/bpf/libbpf.c | 52 +--- > tools/lib/bpf/libbpf_internal.h | 7 +- > tools/lib/bpf/relo_core.c | 8 +- > tools/lib/bpf/relo_core.h | 1 + > 26 files changed, 1127 insertions(+), 177 deletions(-) > > -- Olek
On 08/09/2022 11.30, Alexander Lobakin wrote: > From: Jesper Dangaard Brouer <brouer@redhat.com> > Date: Wed, 07 Sep 2022 17:45:00 +0200 > >> This patchset expose the traditional hardware offload hints to XDP and >> rely on BTF to expose the layout to users. >> [...] >> The main different from RFC-v1: >> - Drop idea of BTF "origin" (vmlinux, module or local) >> - Instead to use full 64-bit BTF ID that combine object+type ID >> >> I've taken some of Alexandr/Larysa's libbpf patches and integrated >> those. > > Not sure if it's okay to inform the authors about the fact only > after sending? Esp from the eeeh... "incompatible" implementation? Just to be clear: I have made sure that developers of the patches maintain authorship (when applied to git via the From: line) and I've Cc'ed the developers directly. I didn't Cc you directly as I knew you would be included via XDP-hints list, and I didn't directly use one of your patches. > I realize it's open code, but this looks sorta depreciatingly. After discussions with Larysa on pre-patchset, I was convinced of the idea of a full 64-bit BTF ID. Thus, I took those patches and carried them in my patchset, instead of reimplementing the same myself. Precisely out of respect for Larysa's work as I wanted to give her credit for coding this. I'm very interested in collaborating. That is why I have picked up patches from your patchset and are carrying them forward. I could just as easily reimplemented them myself. --Jesper
On 09/07, Jesper Dangaard Brouer wrote: > This patchset expose the traditional hardware offload hints to XDP and > rely on BTF to expose the layout to users. > Main idea is that the kernel and NIC drivers simply defines the struct > layouts they choose to use for XDP-hints. These XDP-hints structs gets > naturally and automatically described via BTF and implicitly exported to > users. NIC drivers populate and records their own BTF ID as the last > member in XDP metadata area (making it easily accessible by AF_XDP > userspace at a known negative offset from packet data start). > Naming conventions for the structs (xdp_hints_*) is used such that > userspace can find and decode the BTF layout and match against the > provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting > what XDP-hints a driver supports. > The patch "i40e: Add xdp_hints_union" introduce the idea of creating a > union named "xdp_hints_union" in every driver, which contains all > xdp_hints_* struct this driver can support. This makes it easier/quicker > to find and parse the relevant BTF types. (Seeking input before fixing > up all drivers in patchset). > The main different from RFC-v1: > - Drop idea of BTF "origin" (vmlinux, module or local) > - Instead to use full 64-bit BTF ID that combine object+type ID > I've taken some of Alexandr/Larysa's libbpf patches and integrated > those. > Patchset exceeds netdev usually max 15 patches rule. My excuse is three > NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which > required some refactoring to remove the SKB dependencies. Hey Jesper, I took a quick look at the series. Do we really need the enum with the flags? We might eventually hit that "first 16 bits are reserved" issue? Instead of exposing enum with the flags, why not solve it as follows: a. We define UAPI struct xdp_rx_hints with _all_ possible hints b. Each device defines much denser <device>_xdp_rx_hints struct with the metadata that it supports c. The subset of fields in <device>_xdp_rx_hints should match the ones from xdp_rx_hints (we essentially standardize on the field names/sizes) d. We expose <device>_xdp_rx_hints btf id via netlink for each device e. libbpf will query and do offset relocations for xdp_rx_hints -> <device>_xdp_rx_hints at load time Would that work? Then it seems like we can replace bitfields with the following: if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { /* use that hint */ } All we need here is for libbpf to, again, do xdp_rx_hints -> <device>_xdp_rx_hints translation before it evaluates bpf_core_field_exists()? Thoughts? Any downsides? Am I missing something? Also, about the TX side: I feel like the same can be applied there, the program works with xdp_tx_hints and libbpf will rewrite to <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1"; those, presumably, can be relocatable by libbpf as well?
On 04/10/2022 01.55, sdf@google.com wrote: > On 09/07, Jesper Dangaard Brouer wrote: >> This patchset expose the traditional hardware offload hints to XDP and >> rely on BTF to expose the layout to users. > >> Main idea is that the kernel and NIC drivers simply defines the struct >> layouts they choose to use for XDP-hints. These XDP-hints structs gets >> naturally and automatically described via BTF and implicitly exported to >> users. NIC drivers populate and records their own BTF ID as the last >> member in XDP metadata area (making it easily accessible by AF_XDP >> userspace at a known negative offset from packet data start). > >> Naming conventions for the structs (xdp_hints_*) is used such that >> userspace can find and decode the BTF layout and match against the >> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting >> what XDP-hints a driver supports. > >> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a >> union named "xdp_hints_union" in every driver, which contains all >> xdp_hints_* struct this driver can support. This makes it easier/quicker >> to find and parse the relevant BTF types. (Seeking input before fixing >> up all drivers in patchset). > > >> The main different from RFC-v1: >> - Drop idea of BTF "origin" (vmlinux, module or local) >> - Instead to use full 64-bit BTF ID that combine object+type ID > >> I've taken some of Alexandr/Larysa's libbpf patches and integrated >> those. > >> Patchset exceeds netdev usually max 15 patches rule. My excuse is three >> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which >> required some refactoring to remove the SKB dependencies. > > Hey Jesper, > > I took a quick look at the series. Appreciate that! :-) > Do we really need the enum with the flags? The primary reason for using enum is that these gets exposed as BTF. The proposal is that userspace/BTF need to obtain the flags via BTF, such that they don't become UAPI, but something we can change later. > We might eventually hit that "first 16 bits are reserved" issue? > > Instead of exposing enum with the flags, why not solve it as follows: > a. We define UAPI struct xdp_rx_hints with _all_ possible hints How can we know _all_ possible hints from the beginning(?). UAPI + central struct dictating all possible hints, will limit innovation. > b. Each device defines much denser <device>_xdp_rx_hints struct with the > metadata that it supports Thus, the NIC device is limited to what is defined in UAPI struct xdp_rx_hints. Again this limits innovation. > c. The subset of fields in <device>_xdp_rx_hints should match the ones from > xdp_rx_hints (we essentially standardize on the field names/sizes) > d. We expose <device>_xdp_rx_hints btf id via netlink for each device For this proposed design you would still need more than one BTF ID or <device>_xdp_rx_hints struct's, because not all packets contains all hints. The most common case is HW timestamping, which some HW only supports for PTP frames. Plus, I don't see a need to expose anything via netlink, as we can just use the existing BTF information from the module. Thus, avoiding to creating more UAPI. > e. libbpf will query and do offset relocations for > xdp_rx_hints -> <device>_xdp_rx_hints at load time > > Would that work? Then it seems like we can replace bitfields with the I used to be a fan of bitfields, until I discovered that they are bad for performance, because compilers cannot optimize these. > following: > > if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { > /* use that hint */ Fairly often a VLAN will not be set in packets, so we still have to read and check a bitfield/flag if the VLAN value is valid. (Guess it is implicit in above code). > } > > All we need here is for libbpf to, again, do xdp_rx_hints -> > <device>_xdp_rx_hints translation before it evaluates > bpf_core_field_exists()? > > Thoughts? Any downsides? Am I missing something? > Well, the downside is primarily that this design limits innovation. Each time a NIC driver want to introduce a new hardware hint, they have to update the central UAPI xdp_rx_hints struct first. The design in the patchset is to open for innovation. Driver can extend their own xdp_hints_<driver>_xxx struct(s). They still have to land their patches upstream, but avoid mangling a central UAPI struct. As upstream we review driver changes and should focus on sane struct member naming(+size) especially if this "sounds" like a hint/feature that more driver are likely to support. With help from BTF relocations, a new driver can support same hint/feature if naming(+size) match (without necessary the same offset in the struct). > Also, about the TX side: I feel like the same can be applied there, > the program works with xdp_tx_hints and libbpf will rewrite to > <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1"; > those, presumably, can be relocatable by libbpf as well? > Good to think ahead for TX-side, even-though I think we should focus on landing RX-side first. I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the common struct xdp_hints_common, without a RX/TX direction indication. Maybe this is wrong of me, but my thinking was that most of the common hints can be directly used as TX-side hints. I'm hoping TX-side xdp-hints will need to do little-to-non adjustment, before using the hints as TX "instruction". I'm hoping that XDP-redirect will just work and xmit driver can use XDP-hints area. Please correct me if I'm wrong. The checksum fields hopefully translates to similar TX offload "actions". The VLAN offload hint should translate directly to TX-side. I can easily be convinced we should name it xdp_hints_rx_common from the start, but then I will propose that xdp_hints_tx_common have the checksum and VLAN fields+flags at same locations, such that we don't take any performance hint for moving them to "TX-side" hints, making XDP-redirect just work. --Jesper
On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > > On 04/10/2022 01.55, sdf@google.com wrote: > > On 09/07, Jesper Dangaard Brouer wrote: > >> This patchset expose the traditional hardware offload hints to XDP and > >> rely on BTF to expose the layout to users. > > > >> Main idea is that the kernel and NIC drivers simply defines the struct > >> layouts they choose to use for XDP-hints. These XDP-hints structs gets > >> naturally and automatically described via BTF and implicitly exported to > >> users. NIC drivers populate and records their own BTF ID as the last > >> member in XDP metadata area (making it easily accessible by AF_XDP > >> userspace at a known negative offset from packet data start). > > > >> Naming conventions for the structs (xdp_hints_*) is used such that > >> userspace can find and decode the BTF layout and match against the > >> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting > >> what XDP-hints a driver supports. > > > >> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a > >> union named "xdp_hints_union" in every driver, which contains all > >> xdp_hints_* struct this driver can support. This makes it easier/quicker > >> to find and parse the relevant BTF types. (Seeking input before fixing > >> up all drivers in patchset). > > > > > >> The main different from RFC-v1: > >> - Drop idea of BTF "origin" (vmlinux, module or local) > >> - Instead to use full 64-bit BTF ID that combine object+type ID > > > >> I've taken some of Alexandr/Larysa's libbpf patches and integrated > >> those. > > > >> Patchset exceeds netdev usually max 15 patches rule. My excuse is three > >> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which > >> required some refactoring to remove the SKB dependencies. > > > > Hey Jesper, > > > > I took a quick look at the series. > Appreciate that! :-) > > > Do we really need the enum with the flags? > > The primary reason for using enum is that these gets exposed as BTF. > The proposal is that userspace/BTF need to obtain the flags via BTF, > such that they don't become UAPI, but something we can change later. > > > We might eventually hit that "first 16 bits are reserved" issue? > > > > Instead of exposing enum with the flags, why not solve it as follows: > > a. We define UAPI struct xdp_rx_hints with _all_ possible hints > > How can we know _all_ possible hints from the beginning(?). > > UAPI + central struct dictating all possible hints, will limit innovation. We don't need to know them all in advance. The same way we don't know them all for flags enum. That UAPI xdp_rx_hints can be extended any time some driver needs some new hint offload. The benefit here is that we have a "common registry" of all offloads and different drivers have an opportunity to share. Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake uapi struct (__sk_buff) and the access to it gets translated into <device>_xdp_rx_hints offsets (sk_buff). > > b. Each device defines much denser <device>_xdp_rx_hints struct with the > > metadata that it supports > > Thus, the NIC device is limited to what is defined in UAPI struct > xdp_rx_hints. Again this limits innovation. I guess what I'm missing from your series is the bpf/userspace side. Do you have an example on the bpf side that will work for, say, xdp_hints_ixgbe_timestamp? Suppose, you pass this custom hints btf_id via xdp_md as proposed, what's the action on the bpf side to consume this? If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly populated at runtime by libbpf? */) { // do something with rx_timestamp // also, handle xdp_hints_ixgbe and then xdp_hints_common ? } else if (ctx_hints_btf_id == xdp_hints_ixgbe) { // do something else // plus explicitly handle xdp_hints_common here? } else { // handle xdp_hints_common } What I'd like to avoid is an xdp program targeting specific drivers. Where possible, we should aim towards something like "if this device has rx_timestamp offload -> use it without depending too much on specific btf_ids. > > c. The subset of fields in <device>_xdp_rx_hints should match the ones from > > xdp_rx_hints (we essentially standardize on the field names/sizes) > > d. We expose <device>_xdp_rx_hints btf id via netlink for each device > > For this proposed design you would still need more than one BTF ID or > <device>_xdp_rx_hints struct's, because not all packets contains all > hints. The most common case is HW timestamping, which some HW only > supports for PTP frames. > > Plus, I don't see a need to expose anything via netlink, as we can just > use the existing BTF information from the module. Thus, avoiding to > creating more UAPI. See above. I think even with your series, that btf_id info should also come via netlink so the programs can query it before loading and do the required adjustments. Otherwise, I'm not sure I understand what I need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too late? I need to know them in advance to at least populate those ids into the bpf program itself? > > e. libbpf will query and do offset relocations for > > xdp_rx_hints -> <device>_xdp_rx_hints at load time > > > > Would that work? Then it seems like we can replace bitfields with the > > I used to be a fan of bitfields, until I discovered that they are bad > for performance, because compilers cannot optimize these. Ack, good point, something to keep in mind. > > following: > > > > if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { > > /* use that hint */ > > Fairly often a VLAN will not be set in packets, so we still have to read > and check a bitfield/flag if the VLAN value is valid. (Guess it is > implicit in above code). That's a fair point. Then we need two signals? 1. Whether this particular offload is supported for the device at all (via that bpf_core_field_exists or something similar) 2. Whether this particular packet has particular metadata (via your proposed flags) if (device I'm attaching xdp to has vlan offload) { // via bpf_core_field_exists? if (particular packet comes with a vlan tag) { // via your proposed bitfield flags? } } Or are we assuming that (2) is fast enough and we don't care about (1)? Because (1) can 'if (0)' the whole branch and make the verifier remove that part. > > } > > > > All we need here is for libbpf to, again, do xdp_rx_hints -> > > <device>_xdp_rx_hints translation before it evaluates > > bpf_core_field_exists()? > > > > Thoughts? Any downsides? Am I missing something? > > > > Well, the downside is primarily that this design limits innovation. > > Each time a NIC driver want to introduce a new hardware hint, they have > to update the central UAPI xdp_rx_hints struct first. > > The design in the patchset is to open for innovation. Driver can extend > their own xdp_hints_<driver>_xxx struct(s). They still have to land > their patches upstream, but avoid mangling a central UAPI struct. As > upstream we review driver changes and should focus on sane struct member > naming(+size) especially if this "sounds" like a hint/feature that more > driver are likely to support. With help from BTF relocations, a new > driver can support same hint/feature if naming(+size) match (without > necessary the same offset in the struct). The opposite side of this approach is that we'll have 'ixgbe_hints' with 'rx_timestamp' and 'mvneta_hints' with something like 'rx_tstamp'. > > Also, about the TX side: I feel like the same can be applied there, > > the program works with xdp_tx_hints and libbpf will rewrite to > > <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1"; > > those, presumably, can be relocatable by libbpf as well? > > > > Good to think ahead for TX-side, even-though I think we should focus on > landing RX-side first. > > I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the > common struct xdp_hints_common, without a RX/TX direction indication. > Maybe this is wrong of me, but my thinking was that most of the common > hints can be directly used as TX-side hints. I'm hoping TX-side > xdp-hints will need to do little-to-non adjustment, before using the > hints as TX "instruction". I'm hoping that XDP-redirect will just work > and xmit driver can use XDP-hints area. > > Please correct me if I'm wrong. > The checksum fields hopefully translates to similar TX offload "actions". > The VLAN offload hint should translate directly to TX-side. > > I can easily be convinced we should name it xdp_hints_rx_common from the > start, but then I will propose that xdp_hints_tx_common have the > checksum and VLAN fields+flags at same locations, such that we don't > take any performance hint for moving them to "TX-side" hints, making > XDP-redirect just work. Might be good to think about this beforehand. I agree that most of the layout should hopefully match. However once case that I'm interested in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp in the metadata; for tx, I'm merely setting a flag somewhere to request it for async delivery later (I hope we plan to support that for af_xdp?). So the layout might be completely different :-( On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > > On 04/10/2022 01.55, sdf@google.com wrote: > > On 09/07, Jesper Dangaard Brouer wrote: > >> This patchset expose the traditional hardware offload hints to XDP and > >> rely on BTF to expose the layout to users. > > > >> Main idea is that the kernel and NIC drivers simply defines the struct > >> layouts they choose to use for XDP-hints. These XDP-hints structs gets > >> naturally and automatically described via BTF and implicitly exported to > >> users. NIC drivers populate and records their own BTF ID as the last > >> member in XDP metadata area (making it easily accessible by AF_XDP > >> userspace at a known negative offset from packet data start). > > > >> Naming conventions for the structs (xdp_hints_*) is used such that > >> userspace can find and decode the BTF layout and match against the > >> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting > >> what XDP-hints a driver supports. > > > >> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a > >> union named "xdp_hints_union" in every driver, which contains all > >> xdp_hints_* struct this driver can support. This makes it easier/quicker > >> to find and parse the relevant BTF types. (Seeking input before fixing > >> up all drivers in patchset). > > > > > >> The main different from RFC-v1: > >> - Drop idea of BTF "origin" (vmlinux, module or local) > >> - Instead to use full 64-bit BTF ID that combine object+type ID > > > >> I've taken some of Alexandr/Larysa's libbpf patches and integrated > >> those. > > > >> Patchset exceeds netdev usually max 15 patches rule. My excuse is three > >> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which > >> required some refactoring to remove the SKB dependencies. > > > > Hey Jesper, > > > > I took a quick look at the series. > Appreciate that! :-) > > > Do we really need the enum with the flags? > > The primary reason for using enum is that these gets exposed as BTF. > The proposal is that userspace/BTF need to obtain the flags via BTF, > such that they don't become UAPI, but something we can change later. > > > We might eventually hit that "first 16 bits are reserved" issue? > > > > Instead of exposing enum with the flags, why not solve it as follows: > > a. We define UAPI struct xdp_rx_hints with _all_ possible hints > > How can we know _all_ possible hints from the beginning(?). > > UAPI + central struct dictating all possible hints, will limit innovation. > > > b. Each device defines much denser <device>_xdp_rx_hints struct with the > > metadata that it supports > > Thus, the NIC device is limited to what is defined in UAPI struct > xdp_rx_hints. Again this limits innovation. > > > c. The subset of fields in <device>_xdp_rx_hints should match the ones from > > xdp_rx_hints (we essentially standardize on the field names/sizes) > > d. We expose <device>_xdp_rx_hints btf id via netlink for each device > > For this proposed design you would still need more than one BTF ID or > <device>_xdp_rx_hints struct's, because not all packets contains all > hints. The most common case is HW timestamping, which some HW only > supports for PTP frames. > > Plus, I don't see a need to expose anything via netlink, as we can just > use the existing BTF information from the module. Thus, avoiding to > creating more UAPI. > > > e. libbpf will query and do offset relocations for > > xdp_rx_hints -> <device>_xdp_rx_hints at load time > > > > Would that work? Then it seems like we can replace bitfields with the > > I used to be a fan of bitfields, until I discovered that they are bad > for performance, because compilers cannot optimize these. > > > following: > > > > if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { > > /* use that hint */ > > Fairly often a VLAN will not be set in packets, so we still have to read > and check a bitfield/flag if the VLAN value is valid. (Guess it is > implicit in above code). > > > } > > > > All we need here is for libbpf to, again, do xdp_rx_hints -> > > <device>_xdp_rx_hints translation before it evaluates > > bpf_core_field_exists()? > > > > Thoughts? Any downsides? Am I missing something? > > > > Well, the downside is primarily that this design limits innovation. > > Each time a NIC driver want to introduce a new hardware hint, they have > to update the central UAPI xdp_rx_hints struct first. > > The design in the patchset is to open for innovation. Driver can extend > their own xdp_hints_<driver>_xxx struct(s). They still have to land > their patches upstream, but avoid mangling a central UAPI struct. As > upstream we review driver changes and should focus on sane struct member > naming(+size) especially if this "sounds" like a hint/feature that more > driver are likely to support. With help from BTF relocations, a new > driver can support same hint/feature if naming(+size) match (without > necessary the same offset in the struct). > > > Also, about the TX side: I feel like the same can be applied there, > > the program works with xdp_tx_hints and libbpf will rewrite to > > <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1"; > > those, presumably, can be relocatable by libbpf as well? > > > > Good to think ahead for TX-side, even-though I think we should focus on > landing RX-side first. > > I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the > common struct xdp_hints_common, without a RX/TX direction indication. > Maybe this is wrong of me, but my thinking was that most of the common > hints can be directly used as TX-side hints. I'm hoping TX-side > xdp-hints will need to do little-to-non adjustment, before using the > hints as TX "instruction". I'm hoping that XDP-redirect will just work > and xmit driver can use XDP-hints area. > > Please correct me if I'm wrong. > The checksum fields hopefully translates to similar TX offload "actions". > The VLAN offload hint should translate directly to TX-side. > > I can easily be convinced we should name it xdp_hints_rx_common from the > start, but then I will propose that xdp_hints_tx_common have the > checksum and VLAN fields+flags at same locations, such that we don't > take any performance hint for moving them to "TX-side" hints, making > XDP-redirect just work. > > --Jesper >
On 10/4/22 11:26 AM, Stanislav Fomichev wrote: > On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer > <jbrouer@redhat.com> wrote: >> >> >> On 04/10/2022 01.55, sdf@google.com wrote: >>> On 09/07, Jesper Dangaard Brouer wrote: >>>> This patchset expose the traditional hardware offload hints to XDP and >>>> rely on BTF to expose the layout to users. >>> >>>> Main idea is that the kernel and NIC drivers simply defines the struct >>>> layouts they choose to use for XDP-hints. These XDP-hints structs gets >>>> naturally and automatically described via BTF and implicitly exported to >>>> users. NIC drivers populate and records their own BTF ID as the last >>>> member in XDP metadata area (making it easily accessible by AF_XDP >>>> userspace at a known negative offset from packet data start). >>> >>>> Naming conventions for the structs (xdp_hints_*) is used such that >>>> userspace can find and decode the BTF layout and match against the >>>> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting >>>> what XDP-hints a driver supports. >>> >>>> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a >>>> union named "xdp_hints_union" in every driver, which contains all >>>> xdp_hints_* struct this driver can support. This makes it easier/quicker >>>> to find and parse the relevant BTF types. (Seeking input before fixing >>>> up all drivers in patchset). >>> >>> >>>> The main different from RFC-v1: >>>> - Drop idea of BTF "origin" (vmlinux, module or local) >>>> - Instead to use full 64-bit BTF ID that combine object+type ID >>> >>>> I've taken some of Alexandr/Larysa's libbpf patches and integrated >>>> those. >>> >>>> Patchset exceeds netdev usually max 15 patches rule. My excuse is three >>>> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which >>>> required some refactoring to remove the SKB dependencies. >>> >>> Hey Jesper, >>> >>> I took a quick look at the series. >> Appreciate that! :-) >> >>> Do we really need the enum with the flags? >> >> The primary reason for using enum is that these gets exposed as BTF. >> The proposal is that userspace/BTF need to obtain the flags via BTF, >> such that they don't become UAPI, but something we can change later. >> >>> We might eventually hit that "first 16 bits are reserved" issue? >>> >>> Instead of exposing enum with the flags, why not solve it as follows: >>> a. We define UAPI struct xdp_rx_hints with _all_ possible hints >> >> How can we know _all_ possible hints from the beginning(?). >> >> UAPI + central struct dictating all possible hints, will limit innovation. > > We don't need to know them all in advance. The same way we don't know > them all for flags enum. That UAPI xdp_rx_hints can be extended any > time some driver needs some new hint offload. The benefit here is that > we have a "common registry" of all offloads and different drivers have > an opportunity to share. > > Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake > uapi struct (__sk_buff) and the access to it gets translated into > <device>_xdp_rx_hints offsets (sk_buff). > >>> b. Each device defines much denser <device>_xdp_rx_hints struct with the >>> metadata that it supports >> >> Thus, the NIC device is limited to what is defined in UAPI struct >> xdp_rx_hints. Again this limits innovation. > > I guess what I'm missing from your series is the bpf/userspace side. > Do you have an example on the bpf side that will work for, say, > xdp_hints_ixgbe_timestamp? +1. A selftest is useful. > > Suppose, you pass this custom hints btf_id via xdp_md as proposed, > what's the action on the bpf side to consume this? > > If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly > populated at runtime by libbpf? */) { > // do something with rx_timestamp > // also, handle xdp_hints_ixgbe and then xdp_hints_common ? > } else if (ctx_hints_btf_id == xdp_hints_ixgbe) { > // do something else > // plus explicitly handle xdp_hints_common here? > } else { > // handle xdp_hints_common > } > > What I'd like to avoid is an xdp program targeting specific drivers. > Where possible, we should aim towards something like "if this device > has rx_timestamp offload -> use it without depending too much on > specific btf_ids. It would be my preference also if it can avoid btf_id comparison of a specific driver like the above and let the libbpf CO-RE to handle the matching/relocation. For rx hwtimestamp, the value could be just 0 if a specific hw/driver cannot provide it for all packets while some other hw can. A intentionally wild question, what does it take for the driver to return the hints. Is the rx_desc and rx_queue enough? When the xdp prog is calling a kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver replace it with some inline bpf code (like how the inline code is generated for the map_lookup helper). The xdp prog can then store the hwstamp in the meta area in any layout it wants.
On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote: > A intentionally wild question, what does it take for the driver to return the > hints. Is the rx_desc and rx_queue enough? When the xdp prog is calling a > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver > replace it with some inline bpf code (like how the inline code is generated for > the map_lookup helper). The xdp prog can then store the hwstamp in the meta > area in any layout it wants. Since you mentioned it... FWIW that was always my preference rather than the BTF magic :) The jited image would have to be per-driver like we do for BPF offload but that's easy to do from the technical perspective (I doubt many deployments bind the same prog to multiple HW devices)..
On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote: > > A intentionally wild question, what does it take for the driver to return the > > hints. Is the rx_desc and rx_queue enough? When the xdp prog is calling a > > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver > > replace it with some inline bpf code (like how the inline code is generated for > > the map_lookup helper). The xdp prog can then store the hwstamp in the meta > > area in any layout it wants. > > Since you mentioned it... FWIW that was always my preference rather than > the BTF magic :) The jited image would have to be per-driver like we > do for BPF offload but that's easy to do from the technical > perspective (I doubt many deployments bind the same prog to multiple > HW devices).. +1, sounds like a good alternative (got your reply while typing) I'm not too versed in the rx_desc/rx_queue area, but seems like worst case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and parse it out from the pre-populated metadata? Btw, do we also need to think about the redirect case? What happens when I redirect one frame from a device A with one metadata format to a device B with another?
On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote: > +1, sounds like a good alternative (got your reply while typing) > I'm not too versed in the rx_desc/rx_queue area, but seems like worst > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and > parse it out from the pre-populated metadata? I'd think so, worst case the driver can put xdp_md into a struct and container_of() to get to its own stack with whatever fields it needs. > Btw, do we also need to think about the redirect case? What happens > when I redirect one frame from a device A with one metadata format to > a device B with another? If there is a program on Tx then it'd be trivial - just do the info <-> descriptor translation in the opposite direction than Rx. TBH I'm not sure how it'd be done in the current approach, either. Now I questioned the BTF way and mentioned the Tx-side program in a single thread, I better stop talking...
On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote: > > +1, sounds like a good alternative (got your reply while typing) > > I'm not too versed in the rx_desc/rx_queue area, but seems like worst > > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and > > parse it out from the pre-populated metadata? > > I'd think so, worst case the driver can put xdp_md into a struct > and container_of() to get to its own stack with whatever fields > it needs. Ack, seems like something worth exploring then. The only issue I see with that is that we'd probably have to extend the loading api to pass target xdp device so we can pre-generate per-device bytecode for those kfuncs? And this potentially will block attaching the same program to different drivers/devices? Or, Martin, did you maybe have something better in mind? > > Btw, do we also need to think about the redirect case? What happens > > when I redirect one frame from a device A with one metadata format to > > a device B with another? > > If there is a program on Tx then it'd be trivial - just do the > info <-> descriptor translation in the opposite direction than Rx. > TBH I'm not sure how it'd be done in the current approach, either. Yeah, I don't think it magically works in any case. I'm just trying to understand whether it's something we care to support out of the box or can punt to the bpf programs themselves and say "if you care about forwarding metadata, somehow agree on the format yourself". > Now I questioned the BTF way and mentioned the Tx-side program in > a single thread, I better stop talking... Forget about btf, hail to the new king - kfunc :-D
Stanislav Fomichev <sdf@google.com> writes: > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote: >> > A intentionally wild question, what does it take for the driver to return the >> > hints. Is the rx_desc and rx_queue enough? When the xdp prog is calling a >> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver >> > replace it with some inline bpf code (like how the inline code is generated for >> > the map_lookup helper). The xdp prog can then store the hwstamp in the meta >> > area in any layout it wants. >> >> Since you mentioned it... FWIW that was always my preference rather than >> the BTF magic :) The jited image would have to be per-driver like we >> do for BPF offload but that's easy to do from the technical >> perspective (I doubt many deployments bind the same prog to multiple >> HW devices).. > > +1, sounds like a good alternative (got your reply while typing) > I'm not too versed in the rx_desc/rx_queue area, but seems like worst > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and > parse it out from the pre-populated metadata? > > Btw, do we also need to think about the redirect case? What happens > when I redirect one frame from a device A with one metadata format to > a device B with another? Yes, we absolutely do! In fact, to me this (redirects) is the main reason why we need the ID in the packet in the first place: when running on (say) a veth, an XDP program needs to be able to deal with packets from multiple physical NICs. As far as API is concerned, my hope was that we could solve this with a CO-RE like approach where the program author just writes something like: hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64); and bpf_get_xdp_hint() is really a macro (or a special kind of relocation?) and libbpf would do the following on load: - query the kernel BTF for all possible xdp_hint structs - figure out which of them have an 'u64 hw_tstamp' member - generate the necessary conditionals / jump table to disambiguate on the BTF_ID in the packet Now, if this is better done by a kfunc I'm not terribly opposed to that either, but I'm not sure it's actually better/easier to do in the kernel than in libbpf at load time? -Toke
On 04-Oct-22 10:29 AM, Jesper Dangaard Brouer wrote: > > On 04/10/2022 01.55, sdf@google.com wrote: >> On 09/07, Jesper Dangaard Brouer wrote: >>> This patchset expose the traditional hardware offload hints to XDP and >>> rely on BTF to expose the layout to users. >> >>> Main idea is that the kernel and NIC drivers simply defines the struct >>> layouts they choose to use for XDP-hints. These XDP-hints structs gets >>> naturally and automatically described via BTF and implicitly exported to >>> users. NIC drivers populate and records their own BTF ID as the last >>> member in XDP metadata area (making it easily accessible by AF_XDP >>> userspace at a known negative offset from packet data start). >> >>> Naming conventions for the structs (xdp_hints_*) is used such that >>> userspace can find and decode the BTF layout and match against the >>> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting >>> what XDP-hints a driver supports. >> >>> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a >>> union named "xdp_hints_union" in every driver, which contains all >>> xdp_hints_* struct this driver can support. This makes it easier/quicker >>> to find and parse the relevant BTF types. (Seeking input before fixing >>> up all drivers in patchset). >> >> >>> The main different from RFC-v1: >>> - Drop idea of BTF "origin" (vmlinux, module or local) >>> - Instead to use full 64-bit BTF ID that combine object+type ID >> >>> I've taken some of Alexandr/Larysa's libbpf patches and integrated >>> those. >> >>> Patchset exceeds netdev usually max 15 patches rule. My excuse is three >>> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which >>> required some refactoring to remove the SKB dependencies. >> >> Hey Jesper, >> >> I took a quick look at the series. > Appreciate that! :-) > >> Do we really need the enum with the flags? > > The primary reason for using enum is that these gets exposed as BTF. > The proposal is that userspace/BTF need to obtain the flags via BTF, > such that they don't become UAPI, but something we can change later. > >> We might eventually hit that "first 16 bits are reserved" issue? >> >> Instead of exposing enum with the flags, why not solve it as follows: >> a. We define UAPI struct xdp_rx_hints with _all_ possible hints > > How can we know _all_ possible hints from the beginning(?). > > UAPI + central struct dictating all possible hints, will limit innovation. > >> b. Each device defines much denser <device>_xdp_rx_hints struct with the >> metadata that it supports > > Thus, the NIC device is limited to what is defined in UAPI struct > xdp_rx_hints. Again this limits innovation. > >> c. The subset of fields in <device>_xdp_rx_hints should match the ones >> from >> xdp_rx_hints (we essentially standardize on the field names/sizes) >> d. We expose <device>_xdp_rx_hints btf id via netlink for each device > > For this proposed design you would still need more than one BTF ID or > <device>_xdp_rx_hints struct's, because not all packets contains all > hints. The most common case is HW timestamping, which some HW only > supports for PTP frames. > > Plus, I don't see a need to expose anything via netlink, as we can just > use the existing BTF information from the module. Thus, avoiding to > creating more UAPI. > >> e. libbpf will query and do offset relocations for >> xdp_rx_hints -> <device>_xdp_rx_hints at load time >> >> Would that work? Then it seems like we can replace bitfields with the > > I used to be a fan of bitfields, until I discovered that they are bad > for performance, because compilers cannot optimize these. > >> following: >> >> if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { >> /* use that hint */ > > Fairly often a VLAN will not be set in packets, so we still have to read > and check a bitfield/flag if the VLAN value is valid. (Guess it is > implicit in above code). > >> } >> >> All we need here is for libbpf to, again, do xdp_rx_hints -> >> <device>_xdp_rx_hints translation before it evaluates >> bpf_core_field_exists()? >> >> Thoughts? Any downsides? Am I missing something? >> > > Well, the downside is primarily that this design limits innovation. > > Each time a NIC driver want to introduce a new hardware hint, they have > to update the central UAPI xdp_rx_hints struct first. > > The design in the patchset is to open for innovation. Driver can extend > their own xdp_hints_<driver>_xxx struct(s). They still have to land > their patches upstream, but avoid mangling a central UAPI struct. As > upstream we review driver changes and should focus on sane struct member > naming(+size) especially if this "sounds" like a hint/feature that more > driver are likely to support. With help from BTF relocations, a new > driver can support same hint/feature if naming(+size) match (without > necessary the same offset in the struct). > >> Also, about the TX side: I feel like the same can be applied there, >> the program works with xdp_tx_hints and libbpf will rewrite to >> <device>_xdp_tx_hints. xdp_tx_hints might have fields like >> "has_tx_vlan:1"; >> those, presumably, can be relocatable by libbpf as well? >> > > Good to think ahead for TX-side, even-though I think we should focus on > landing RX-side first. > > I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the > common struct xdp_hints_common, without a RX/TX direction indication. > Maybe this is wrong of me, but my thinking was that most of the common > hints can be directly used as TX-side hints. I'm hoping TX-side > xdp-hints will need to do little-to-non adjustment, before using the > hints as TX "instruction". I'm hoping that XDP-redirect will just work > and xmit driver can use XDP-hints area. > > Please correct me if I'm wrong. > The checksum fields hopefully translates to similar TX offload "actions". > The VLAN offload hint should translate directly to TX-side. Like I indicated in another response, not necessarily. Rx checksum typically indicates that the checksumming was completed and checksum was good/bad, but for Tx we actually supply offsets (possibly multiple ones, depending on L2/L3/L4 packet, plus there's also a need to distinguish between packet types as different NICs will have different offload bits for different ptypes) in the metadata. So, while VLAN offload may or may not translate directly to the Tx side of things, checksumming probably won't. > > I can easily be convinced we should name it xdp_hints_rx_common from the > start, but then I will propose that xdp_hints_tx_common have the > checksum and VLAN fields+flags at same locations, such that we don't > take any performance hint for moving them to "TX-side" hints, making > XDP-redirect just work. > > --Jesper >
On 05/10/2022 02.25, Martin KaFai Lau wrote: > For rx hwtimestamp, the value could be just 0 if a specific hw/driver > cannot provide it for all packets while some other hw can. Keep in mind that we want to avoid having to write a (64-bit) zero into the metadata for rx_hwtimestamp, for every packet that doesn't carry a timestamp. It essentially reverts back to clearing memory like with SKBs, due to performance overhead we don't want to go that path again! There are multiple ways to avoid having to zero init the memory. In this patchset I have choosen have the traditional approach of flags (u32) approach located in xdp_hints_common, e.g. setting a flag if the field is valid (p.s. John Fastabend convinced me of this approach ;-)). But COMBINED with: some BTF ID layouts doesn't contain some fields e.g. the rx_timestamp, thus the code have no reason to query those flag fields. I am intrigued to find a way to leverage bpf_core_field_exists() some more (as proposed by Stanislav). (As this can allow for dead-code elimination). --Jesper
On 05/10/2022 03.02, Stanislav Fomichev wrote: > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote: >>> A intentionally wild question, what does it take for the driver to return the >>> hints. Is the rx_desc and rx_queue enough? When the xdp prog is calling a >>> kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the driver >>> replace it with some inline bpf code (like how the inline code is generated for >>> the map_lookup helper). The xdp prog can then store the hwstamp in the meta >>> area in any layout it wants. >> >> Since you mentioned it... FWIW that was always my preference rather than >> the BTF magic :) The jited image would have to be per-driver like we >> do for BPF offload but that's easy to do from the technical >> perspective (I doubt many deployments bind the same prog to multiple >> HW devices).. On the technical side we do have the ifindex that can be passed along which is currently used for getting XDP hardware offloading to work. But last time I tried this, I failed due to BPF tail call maps. (It's not going to fly for other reasons, see redirect below). > > +1, sounds like a good alternative (got your reply while typing) > I'm not too versed in the rx_desc/rx_queue area, but seems like worst > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and > parse it out from the pre-populated metadata? > > Btw, do we also need to think about the redirect case? What happens > when I redirect one frame from a device A with one metadata format to > a device B with another? Exactly the problem. With XDP redirect the "remote" target device also need to interpret this metadata layout. For RX-side we have the immediate case with redirecting into a veth device. For future TX-side this is likely the same kind of issue, but I hope if we can solve this for veth redirect use-case, this will keep us future proof. For veth use-case I hope that we can use same trick as bpf_core_field_exists() to do dead-code elimination based on if a device driver is loaded on the system like this pseudo code: if (bpf_core_type_id_kernel(struct xdp_hints_i40e_timestamp)) { /* check id + extract timestamp */ } if (bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp)) { /* check id + extract timestamp */ } If the given device drives doesn't exist on the system, I assume bpf_core_type_id_kernel() will return 0 at libbpf relocation/load-time, and thus this should cause dead-code elimination. Should work today AFAIK? --Jesper
On 04/10/2022 20.26, Stanislav Fomichev wrote: > On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer > <jbrouer@redhat.com> wrote: >> >> >> On 04/10/2022 01.55, sdf@google.com wrote: >>> On 09/07, Jesper Dangaard Brouer wrote: >>>> This patchset expose the traditional hardware offload hints to XDP and >>>> rely on BTF to expose the layout to users. >>> >>>> Main idea is that the kernel and NIC drivers simply defines the struct >>>> layouts they choose to use for XDP-hints. These XDP-hints structs gets >>>> naturally and automatically described via BTF and implicitly exported to >>>> users. NIC drivers populate and records their own BTF ID as the last >>>> member in XDP metadata area (making it easily accessible by AF_XDP >>>> userspace at a known negative offset from packet data start). >>> >>>> Naming conventions for the structs (xdp_hints_*) is used such that >>>> userspace can find and decode the BTF layout and match against the >>>> provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting >>>> what XDP-hints a driver supports. >>> >>>> The patch "i40e: Add xdp_hints_union" introduce the idea of creating a >>>> union named "xdp_hints_union" in every driver, which contains all >>>> xdp_hints_* struct this driver can support. This makes it easier/quicker >>>> to find and parse the relevant BTF types. (Seeking input before fixing >>>> up all drivers in patchset). >>> >>> >>>> The main different from RFC-v1: >>>> - Drop idea of BTF "origin" (vmlinux, module or local) >>>> - Instead to use full 64-bit BTF ID that combine object+type ID >>> >>>> I've taken some of Alexandr/Larysa's libbpf patches and integrated >>>> those. >>> >>>> Patchset exceeds netdev usually max 15 patches rule. My excuse is three >>>> NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which >>>> required some refactoring to remove the SKB dependencies. >>> >>> Hey Jesper, >>> >>> I took a quick look at the series. >> Appreciate that! :-) >> >>> Do we really need the enum with the flags? >> >> The primary reason for using enum is that these gets exposed as BTF. >> The proposal is that userspace/BTF need to obtain the flags via BTF, >> such that they don't become UAPI, but something we can change later. >> >>> We might eventually hit that "first 16 bits are reserved" issue? >>> >>> Instead of exposing enum with the flags, why not solve it as follows: >>> a. We define UAPI struct xdp_rx_hints with _all_ possible hints >> >> How can we know _all_ possible hints from the beginning(?). >> >> UAPI + central struct dictating all possible hints, will limit innovation. > > We don't need to know them all in advance. The same way we don't know > them all for flags enum. That UAPI xdp_rx_hints can be extended any > time some driver needs some new hint offload. The benefit here is that > we have a "common registry" of all offloads and different drivers have > an opportunity to share. > > Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake > uapi struct (__sk_buff) and the access to it gets translated into > <device>_xdp_rx_hints offsets (sk_buff). > >>> b. Each device defines much denser <device>_xdp_rx_hints struct with the >>> metadata that it supports >> >> Thus, the NIC device is limited to what is defined in UAPI struct >> xdp_rx_hints. Again this limits innovation. > > I guess what I'm missing from your series is the bpf/userspace side. > Do you have an example on the bpf side that will work for, say, > xdp_hints_ixgbe_timestamp? > > Suppose, you pass this custom hints btf_id via xdp_md as proposed, I just want to reiterate why we place btf_full_id at the "end inline". This makes it easily available for AF_XDP to consume. Plus, we already have to write info into this metadata cache-line anyway, thus it's almost free. Moving bpf_full_id into xdp_md, will require expanding both xdp_buff and xdp_frame (+ extra store for converting buff-to-frame). If AF_XDP need this btf_full_id the BPF-prog _could_ move/copy it from xdp_md to metadata, but that will just waste cycles, why not just store it once in a known location. One option, for convenience, would be to map xdp_md->bpf_full_id to load the btf_full_id value from the metadata. But that would essentially be syntax-sugar and adds UAPI. > what's the action on the bpf side to consume this? > > If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly > populated at runtime by libbpf? */) { See e.g. bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp) AFAIK libbpf will make this a constant at load/setup time, and give us dead-code elimination. > // do something with rx_timestamp > // also, handle xdp_hints_ixgbe and then xdp_hints_common ? > } else if (ctx_hints_btf_id == xdp_hints_ixgbe) { > // do something else > // plus explicitly handle xdp_hints_common here? > } else { > // handle xdp_hints_common > } I added a BPF-helper that can tell us if layout if compatible with xdp_hints_common, which is basically the only UAPI the patchset introduces. The handle xdp_hints_common code should be common. I'm not super happy with the BPF-helper approach, so suggestions are welcome. E.g. xdp_md/ctx->is_hint_common could be one approach and ctx->has_hint (ctx is often called xdp so it reads xdp->has_hint). One feature I need from the BPF-helper is to "disable" the xdp_hints and allow the BPF-prog to use the entire metadata area for something else (avoiding it to be misintrepreted by next prog or after redirect). > > What I'd like to avoid is an xdp program targeting specific drivers. > Where possible, we should aim towards something like "if this device > has rx_timestamp offload -> use it without depending too much on > specific btf_ids. > I do understand your wish, and adding rx_timestamps to xdp_hints_common would be too easy (and IMHO wasting u64/8-bytes for all packets not needing this timestamp). Hopefully we can come up with a good solution together. One idea would be to extend libbpf to lookup or translate struct name struct xdp_hints_DRIVER_timestamp { __u64 rx_timestamp; } __attribute__((preserve_access_index)); into e.g. xdp_hints_i40e_timestamp, if an ifindex was provided when loading the XDP prog. And the bpf_core_type_id_kernel() result of the struct returning id from xdp_hints_i40e_timestamp. But this ideas doesn't really work for the veth redirect use-case :-( As veth need to handle xdp_hints from other drivers. >>> c. The subset of fields in <device>_xdp_rx_hints should match the ones from >>> xdp_rx_hints (we essentially standardize on the field names/sizes) >>> d. We expose <device>_xdp_rx_hints btf id via netlink for each device >> >> For this proposed design you would still need more than one BTF ID or >> <device>_xdp_rx_hints struct's, because not all packets contains all >> hints. The most common case is HW timestamping, which some HW only >> supports for PTP frames. >> >> Plus, I don't see a need to expose anything via netlink, as we can just >> use the existing BTF information from the module. Thus, avoiding to >> creating more UAPI. > > See above. I think even with your series, that btf_id info should also > come via netlink so the programs can query it before loading and do > the required adjustments. Otherwise, I'm not sure I understand what I > need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too > late? I need to know them in advance to at least populate those ids > into the bpf program itself? Yes, we need to know these IDs in advance and can. I don't think we need the netlink interface, as we can already read out the BTF layout and IDs today. I coded it up in userspace, where the intented consumer is AF_XDP (as libbpf already does this itself). See this code: - https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_ids.c - https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c >>> e. libbpf will query and do offset relocations for >>> xdp_rx_hints -> <device>_xdp_rx_hints at load time >>> >>> Would that work? Then it seems like we can replace bitfields with the >> >> I used to be a fan of bitfields, until I discovered that they are bad >> for performance, because compilers cannot optimize these. > > Ack, good point, something to keep in mind. > >>> following: >>> >>> if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { >>> /* use that hint */ >> >> Fairly often a VLAN will not be set in packets, so we still have to read >> and check a bitfield/flag if the VLAN value is valid. (Guess it is >> implicit in above code). > > That's a fair point. Then we need two signals? > > 1. Whether this particular offload is supported for the device at all > (via that bpf_core_field_exists or something similar) > 2. Whether this particular packet has particular metadata (via your > proposed flags) > > if (device I'm attaching xdp to has vlan offload) { // via > bpf_core_field_exists? > if (particular packet comes with a vlan tag) { // via your proposed > bitfield flags? > } > } > > Or are we assuming that (2) is fast enough and we don't care about > (1)? Because (1) can 'if (0)' the whole branch and make the verifier > remove that part. > >>> } >>> >>> All we need here is for libbpf to, again, do xdp_rx_hints -> >>> <device>_xdp_rx_hints translation before it evaluates >>> bpf_core_field_exists()? >>> >>> Thoughts? Any downsides? Am I missing something? >>> >> >> Well, the downside is primarily that this design limits innovation. >> >> Each time a NIC driver want to introduce a new hardware hint, they have >> to update the central UAPI xdp_rx_hints struct first. >> >> The design in the patchset is to open for innovation. Driver can extend >> their own xdp_hints_<driver>_xxx struct(s). They still have to land >> their patches upstream, but avoid mangling a central UAPI struct. As >> upstream we review driver changes and should focus on sane struct member >> naming(+size) especially if this "sounds" like a hint/feature that more >> driver are likely to support. With help from BTF relocations, a new >> driver can support same hint/feature if naming(+size) match (without >> necessary the same offset in the struct). > > The opposite side of this approach is that we'll have 'ixgbe_hints' > with 'rx_timestamp' and 'mvneta_hints' with something like > 'rx_tstamp'. Well, as I wrote reviewers should ask drivers to use the same member name. >>> Also, about the TX side: I feel like the same can be applied there, >>> the program works with xdp_tx_hints and libbpf will rewrite to >>> <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1"; >>> those, presumably, can be relocatable by libbpf as well? >>> >> >> Good to think ahead for TX-side, even-though I think we should focus on >> landing RX-side first. >> >> I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the >> common struct xdp_hints_common, without a RX/TX direction indication. >> Maybe this is wrong of me, but my thinking was that most of the common >> hints can be directly used as TX-side hints. I'm hoping TX-side >> xdp-hints will need to do little-to-non adjustment, before using the >> hints as TX "instruction". I'm hoping that XDP-redirect will just work >> and xmit driver can use XDP-hints area. >> >> Please correct me if I'm wrong. >> The checksum fields hopefully translates to similar TX offload "actions". >> The VLAN offload hint should translate directly to TX-side. >> >> I can easily be convinced we should name it xdp_hints_rx_common from the >> start, but then I will propose that xdp_hints_tx_common have the >> checksum and VLAN fields+flags at same locations, such that we don't >> take any performance hint for moving them to "TX-side" hints, making >> XDP-redirect just work. > > Might be good to think about this beforehand. I agree that most of the > layout should hopefully match. However once case that I'm interested > in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp > in the metadata; for tx, I'm merely setting a flag somewhere to > request it for async delivery later (I hope we plan to support that > for af_xdp?). So the layout might be completely different :-( > Yes, it is definitely in my plans to support handling at TX-completion time, so you can extract the TX-wire-timestamp. This is easy for AF_XDP as it has the CQ (Completion Queue) step. I'm getting ahead of myself, but for XDP I imagine that driver will populate this xdp_tx_hint in DMA TX-completion function, and we can add a kfunc "not-a-real-hook" to xdp_return_frame that can run another XDP BPF-prog that can inspect the xdp_tx_hint in metadata. At this proposed kfunc xdp_return_frame call point, we likely cannot know what driver that produced the xdp_hints metadata either, and thus not lock our design or BTF-reloacations to assume which driver is it loaded on. [... cut ... getting too long] --Jesper
On 10/05, Jesper Dangaard Brouer wrote: > On 04/10/2022 20.26, Stanislav Fomichev wrote: > > On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer > > <jbrouer@redhat.com> wrote: > > > > > > > > > On 04/10/2022 01.55, sdf@google.com wrote: > > > > On 09/07, Jesper Dangaard Brouer wrote: > > > > > This patchset expose the traditional hardware offload hints to > XDP and > > > > > rely on BTF to expose the layout to users. > > > > > > > > > Main idea is that the kernel and NIC drivers simply defines the > struct > > > > > layouts they choose to use for XDP-hints. These XDP-hints structs > gets > > > > > naturally and automatically described via BTF and implicitly > exported to > > > > > users. NIC drivers populate and records their own BTF ID as the > last > > > > > member in XDP metadata area (making it easily accessible by AF_XDP > > > > > userspace at a known negative offset from packet data start). > > > > > > > > > Naming conventions for the structs (xdp_hints_*) is used such that > > > > > userspace can find and decode the BTF layout and match against the > > > > > provided BTF IDs. Thus, no new UAPI interfaces are needed for > exporting > > > > > what XDP-hints a driver supports. > > > > > > > > > The patch "i40e: Add xdp_hints_union" introduce the idea of > creating a > > > > > union named "xdp_hints_union" in every driver, which contains all > > > > > xdp_hints_* struct this driver can support. This makes it > easier/quicker > > > > > to find and parse the relevant BTF types. (Seeking input before > fixing > > > > > up all drivers in patchset). > > > > > > > > > > > > > The main different from RFC-v1: > > > > > - Drop idea of BTF "origin" (vmlinux, module or local) > > > > > - Instead to use full 64-bit BTF ID that combine object+type ID > > > > > > > > > I've taken some of Alexandr/Larysa's libbpf patches and integrated > > > > > those. > > > > > > > > > Patchset exceeds netdev usually max 15 patches rule. My excuse is > three > > > > > NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and > which > > > > > required some refactoring to remove the SKB dependencies. > > > > > > > > Hey Jesper, > > > > > > > > I took a quick look at the series. > > > Appreciate that! :-) > > > > > > > Do we really need the enum with the flags? > > > > > > The primary reason for using enum is that these gets exposed as BTF. > > > The proposal is that userspace/BTF need to obtain the flags via BTF, > > > such that they don't become UAPI, but something we can change later. > > > > > > > We might eventually hit that "first 16 bits are reserved" issue? > > > > > > > > Instead of exposing enum with the flags, why not solve it as > follows: > > > > a. We define UAPI struct xdp_rx_hints with _all_ possible hints > > > > > > How can we know _all_ possible hints from the beginning(?). > > > > > > UAPI + central struct dictating all possible hints, will limit > innovation. > > > > We don't need to know them all in advance. The same way we don't know > > them all for flags enum. That UAPI xdp_rx_hints can be extended any > > time some driver needs some new hint offload. The benefit here is that > > we have a "common registry" of all offloads and different drivers have > > an opportunity to share. > > > > Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake > > uapi struct (__sk_buff) and the access to it gets translated into > > <device>_xdp_rx_hints offsets (sk_buff). > > > > > > b. Each device defines much denser <device>_xdp_rx_hints struct > with the > > > > metadata that it supports > > > > > > Thus, the NIC device is limited to what is defined in UAPI struct > > > xdp_rx_hints. Again this limits innovation. > > > > I guess what I'm missing from your series is the bpf/userspace side. > > Do you have an example on the bpf side that will work for, say, > > xdp_hints_ixgbe_timestamp? > > > > Suppose, you pass this custom hints btf_id via xdp_md as proposed, > I just want to reiterate why we place btf_full_id at the "end inline". > This makes it easily available for AF_XDP to consume. Plus, we already > have to write info into this metadata cache-line anyway, thus it's > almost free. Moving bpf_full_id into xdp_md, will require expanding > both xdp_buff and xdp_frame (+ extra store for converting > buff-to-frame). If AF_XDP need this btf_full_id the BPF-prog _could_ > move/copy it from xdp_md to metadata, but that will just waste cycles, > why not just store it once in a known location. > One option, for convenience, would be to map xdp_md->bpf_full_id to load > the btf_full_id value from the metadata. But that would essentially be > syntax-sugar and adds UAPI. > > what's the action on the bpf side to consume this? > > > > If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly > > populated at runtime by libbpf? */) { > See e.g. bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp) > AFAIK libbpf will make this a constant at load/setup time, and give us > dead-code elimination. Even with bpf_core_type_id_kernel() you still would have the following: if (ctx_hints_btf_id == bpf_core_type_id_kernel(struct xdp_hints_ixgbe)) { } else if (the same for every driver that has custom hints) { } Toke has a good suggestion on hiding this behind a helper; either pre-generated on the libbpf side or a kfunc. We should try to hide this per-device logic if possible; otherwise we'll get to per-device XDP programs that only work on some special deployments. OTOH, we'll probably get there with the hints anyway? > > // do something with rx_timestamp > > // also, handle xdp_hints_ixgbe and then xdp_hints_common ? > > } else if (ctx_hints_btf_id == xdp_hints_ixgbe) { > > // do something else > > // plus explicitly handle xdp_hints_common here? > > } else { > > // handle xdp_hints_common > > } > I added a BPF-helper that can tell us if layout if compatible with > xdp_hints_common, which is basically the only UAPI the patchset > introduces. > The handle xdp_hints_common code should be common. > I'm not super happy with the BPF-helper approach, so suggestions are > welcome. E.g. xdp_md/ctx->is_hint_common could be one approach and > ctx->has_hint (ctx is often called xdp so it reads xdp->has_hint). > One feature I need from the BPF-helper is to "disable" the xdp_hints and > allow the BPF-prog to use the entire metadata area for something else > (avoiding it to be misintrepreted by next prog or after redirect). As mentioned in the previous emails, let's try to have a bpf side example/selftest for the next round? I also feel like xdp_hints_common is a bit distracting. It makes the common case easy and it hides the discussion/complexity about per-device hints. Maybe we can drop this common case at all? Why can't every driver has a custom hints struct? If we agree that naming/size will be the same across them (and review catches/guaranteed that), why do we even care about having common xdp_hints_common struct? > > What I'd like to avoid is an xdp program targeting specific drivers. > > Where possible, we should aim towards something like "if this device > > has rx_timestamp offload -> use it without depending too much on > > specific btf_ids. > > > I do understand your wish, and adding rx_timestamps to xdp_hints_common > would be too easy (and IMHO wasting u64/8-bytes for all packets not > needing this timestamp). Hopefully we can come up with a good solution > together. > One idea would be to extend libbpf to lookup or translate struct name > struct xdp_hints_DRIVER_timestamp { > __u64 rx_timestamp; > } __attribute__((preserve_access_index)); > into e.g. xdp_hints_i40e_timestamp, if an ifindex was provided when > loading > the XDP prog. And the bpf_core_type_id_kernel() result of the struct > returning id from xdp_hints_i40e_timestamp. > But this ideas doesn't really work for the veth redirect use-case :-( > As veth need to handle xdp_hints from other drivers. Agreed. If we want redirect to work, then the parsing should be either mostly pre-generated by libbpf to include all possible btf ids that matter; or done similarly by a kfunc. The idea that we can pre-generate per-device bpf program seems to be out of the window now? > > > > c. The subset of fields in <device>_xdp_rx_hints should match the > ones from > > > > xdp_rx_hints (we essentially standardize on the field > names/sizes) > > > > d. We expose <device>_xdp_rx_hints btf id via netlink for each > device > > > > > > For this proposed design you would still need more than one BTF ID or > > > <device>_xdp_rx_hints struct's, because not all packets contains all > > > hints. The most common case is HW timestamping, which some HW only > > > supports for PTP frames. > > > > > > Plus, I don't see a need to expose anything via netlink, as we can > just > > > use the existing BTF information from the module. Thus, avoiding to > > > creating more UAPI. > > > > See above. I think even with your series, that btf_id info should also > > come via netlink so the programs can query it before loading and do > > the required adjustments. Otherwise, I'm not sure I understand what I > > need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too > > late? I need to know them in advance to at least populate those ids > > into the bpf program itself? > Yes, we need to know these IDs in advance and can. I don't think we need > the netlink interface, as we can already read out the BTF layout and IDs > today. I coded it up in userspace, where the intented consumer is AF_XDP > (as libbpf already does this itself). > See this code: > - > https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_ids.c > - > https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c SG, if we can have some convention on the names where we can reliably parse out all possible structs with the hints, let's rely solely on vmlinux+vmlinux module btf. > > > > e. libbpf will query and do offset relocations for > > > > xdp_rx_hints -> <device>_xdp_rx_hints at load time > > > > > > > > Would that work? Then it seems like we can replace bitfields with > the > > > > > > I used to be a fan of bitfields, until I discovered that they are bad > > > for performance, because compilers cannot optimize these. > > > > Ack, good point, something to keep in mind. > > > > > > following: > > > > > > > > if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { > > > > /* use that hint */ > > > > > > Fairly often a VLAN will not be set in packets, so we still have to > read > > > and check a bitfield/flag if the VLAN value is valid. (Guess it is > > > implicit in above code). > > > > That's a fair point. Then we need two signals? > > > > 1. Whether this particular offload is supported for the device at all > > (via that bpf_core_field_exists or something similar) > > 2. Whether this particular packet has particular metadata (via your > > proposed flags) > > > > if (device I'm attaching xdp to has vlan offload) { // via > > bpf_core_field_exists? > > if (particular packet comes with a vlan tag) { // via your proposed > > bitfield flags? > > } > > } > > > > Or are we assuming that (2) is fast enough and we don't care about > > (1)? Because (1) can 'if (0)' the whole branch and make the verifier > > remove that part. > > > > > > } > > > > > > > > All we need here is for libbpf to, again, do xdp_rx_hints -> > > > > <device>_xdp_rx_hints translation before it evaluates > > > > bpf_core_field_exists()? > > > > > > > > Thoughts? Any downsides? Am I missing something? > > > > > > > > > > Well, the downside is primarily that this design limits innovation. > > > > > > Each time a NIC driver want to introduce a new hardware hint, they > have > > > to update the central UAPI xdp_rx_hints struct first. > > > > > > The design in the patchset is to open for innovation. Driver can > extend > > > their own xdp_hints_<driver>_xxx struct(s). They still have to land > > > their patches upstream, but avoid mangling a central UAPI struct. As > > > upstream we review driver changes and should focus on sane struct > member > > > naming(+size) especially if this "sounds" like a hint/feature that > more > > > driver are likely to support. With help from BTF relocations, a new > > > driver can support same hint/feature if naming(+size) match (without > > > necessary the same offset in the struct). > > > > The opposite side of this approach is that we'll have 'ixgbe_hints' > > with 'rx_timestamp' and 'mvneta_hints' with something like > > 'rx_tstamp'. > Well, as I wrote reviewers should ask drivers to use the same member name. SG! > > > > Also, about the TX side: I feel like the same can be applied there, > > > > the program works with xdp_tx_hints and libbpf will rewrite to > > > > <device>_xdp_tx_hints. xdp_tx_hints might have fields > like "has_tx_vlan:1"; > > > > those, presumably, can be relocatable by libbpf as well? > > > > > > > > > > Good to think ahead for TX-side, even-though I think we should focus > on > > > landing RX-side first. > > > > > > I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the > > > common struct xdp_hints_common, without a RX/TX direction indication. > > > Maybe this is wrong of me, but my thinking was that most of the common > > > hints can be directly used as TX-side hints. I'm hoping TX-side > > > xdp-hints will need to do little-to-non adjustment, before using the > > > hints as TX "instruction". I'm hoping that XDP-redirect will just > work > > > and xmit driver can use XDP-hints area. > > > > > > Please correct me if I'm wrong. > > > The checksum fields hopefully translates to similar TX > offload "actions". > > > The VLAN offload hint should translate directly to TX-side. > > > > > > I can easily be convinced we should name it xdp_hints_rx_common from > the > > > start, but then I will propose that xdp_hints_tx_common have the > > > checksum and VLAN fields+flags at same locations, such that we don't > > > take any performance hint for moving them to "TX-side" hints, making > > > XDP-redirect just work. > > > > Might be good to think about this beforehand. I agree that most of the > > layout should hopefully match. However once case that I'm interested > > in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp > > in the metadata; for tx, I'm merely setting a flag somewhere to > > request it for async delivery later (I hope we plan to support that > > for af_xdp?). So the layout might be completely different :-( > > > Yes, it is definitely in my plans to support handling at TX-completion > time, so you can extract the TX-wire-timestamp. This is easy for AF_XDP > as it has the CQ (Completion Queue) step. > I'm getting ahead of myself, but for XDP I imagine that driver will > populate this xdp_tx_hint in DMA TX-completion function, and we can add > a kfunc "not-a-real-hook" to xdp_return_frame that can run another XDP > BPF-prog that can inspect the xdp_tx_hint in metadata. Can we also place that xdp_tx_hint somewhere in the completion ring for AF_XDP to consume? > At this proposed kfunc xdp_return_frame call point, we likely cannot know > what driver that produced the xdp_hints metadata either, and thus not lock > our design or BTF-reloacations to assume which driver is it loaded on. > [... cut ... getting too long] > --Jesper
On 10/05, Toke H�iland-J�rgensen wrote: > Stanislav Fomichev <sdf@google.com> writes: > > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote: > >> > >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote: > >> > A intentionally wild question, what does it take for the driver to > return the > >> > hints. Is the rx_desc and rx_queue enough? When the xdp prog is > calling a > >> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can the > driver > >> > replace it with some inline bpf code (like how the inline code is > generated for > >> > the map_lookup helper). The xdp prog can then store the hwstamp in > the meta > >> > area in any layout it wants. > >> > >> Since you mentioned it... FWIW that was always my preference rather > than > >> the BTF magic :) The jited image would have to be per-driver like we > >> do for BPF offload but that's easy to do from the technical > >> perspective (I doubt many deployments bind the same prog to multiple > >> HW devices).. > > > > +1, sounds like a good alternative (got your reply while typing) > > I'm not too versed in the rx_desc/rx_queue area, but seems like worst > > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and > > parse it out from the pre-populated metadata? > > > > Btw, do we also need to think about the redirect case? What happens > > when I redirect one frame from a device A with one metadata format to > > a device B with another? > Yes, we absolutely do! In fact, to me this (redirects) is the main > reason why we need the ID in the packet in the first place: when running > on (say) a veth, an XDP program needs to be able to deal with packets > from multiple physical NICs. > As far as API is concerned, my hope was that we could solve this with a > CO-RE like approach where the program author just writes something like: > hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64); > and bpf_get_xdp_hint() is really a macro (or a special kind of > relocation?) and libbpf would do the following on load: > - query the kernel BTF for all possible xdp_hint structs > - figure out which of them have an 'u64 hw_tstamp' member > - generate the necessary conditionals / jump table to disambiguate on > the BTF_ID in the packet > Now, if this is better done by a kfunc I'm not terribly opposed to that > either, but I'm not sure it's actually better/easier to do in the kernel > than in libbpf at load time? Replied in the other thread, but to reiterate here: then btf_id in the metadata has to stay and we either pre-generate those bpf_get_xdp_hint() at libbpf or at kfunc load time level as you mention. But the program essentially has to handle all possible hints' btf ids thrown at it by the system. Not sure about the performance in this case :-/ Maybe that's something that can be hidden behind "I might receive forwarded packets and I know how to handle all metadata format" flag? By default, we'll pre-generate parsing only for that specific device?
On 10/4/22 7:15 PM, Stanislav Fomichev wrote: > On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote: >>> +1, sounds like a good alternative (got your reply while typing) >>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst >>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and >>> parse it out from the pre-populated metadata? >> >> I'd think so, worst case the driver can put xdp_md into a struct >> and container_of() to get to its own stack with whatever fields >> it needs. > > Ack, seems like something worth exploring then. > > The only issue I see with that is that we'd probably have to extend > the loading api to pass target xdp device so we can pre-generate > per-device bytecode for those kfuncs? There is an existing attr->prog_ifindex for dev offload purpose. May be we can re-purpose/re-use some of the offload API. How this kfunc can be presented also needs some thoughts, could be a new ndo_xxx.... not sure. > And this potentially will block attaching the same program > to different drivers/devices? > Or, Martin, did you maybe have something better in mind? If the kfunc/helper is inline, then it will have to be per device. Unless the bpf prog chooses not to inline which could be an option but I am also not sure how often the user wants to 'attach' a loaded xdp prog to a different device. To some extend, the CO-RE hints-loading-code will have to be per device also, no? Why I asked the kfunc/helper approach is because, from the set, it seems the hints has already been available at the driver. The specific knowledge that the xdp prog missing is how to get the hints from the rx_desc/rx_queue. The straight forward way to me is to make them (rx_desc/rx_queue) available to xdp prog and have kfunc/helper to extract the hints from them only if the xdp prog needs it. The xdp prog can selectively get what hints it needs and then optionally store them into the meta area in any layout. NETIF_F_XDP_HINTS_BIT probably won't be needed and one less thing to worry in production. > >>> Btw, do we also need to think about the redirect case? What happens >>> when I redirect one frame from a device A with one metadata format to >>> a device B with another? >> >> If there is a program on Tx then it'd be trivial - just do the >> info <-> descriptor translation in the opposite direction than Rx. +1 >> TBH I'm not sure how it'd be done in the current approach, either. Yeah, I think we need more selftest to show how things work. > > Yeah, I don't think it magically works in any case. I'm just trying to > understand whether it's something we care to support out of the box or > can punt to the bpf programs themselves and say "if you care about > forwarding metadata, somehow agree on the format yourself". > >> Now I questioned the BTF way and mentioned the Tx-side program in >> a single thread, I better stop talking... > > Forget about btf, hail to the new king - kfunc :-D
On 05/10/2022 19:47, sdf@google.com wrote: > On 10/05, Toke H�iland-J�rgensen wrote: >> Stanislav Fomichev <sdf@google.com> writes: > >> > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> wrote: >> >> >> >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote: >> >> > A intentionally wild question, what does it take for the driver >> to return the >> >> > hints. Is the rx_desc and rx_queue enough? When the xdp prog is >> calling a >> >> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can >> the driver >> >> > replace it with some inline bpf code (like how the inline code is >> generated for >> >> > the map_lookup helper). The xdp prog can then store the hwstamp >> in the meta >> >> > area in any layout it wants. >> >> >> >> Since you mentioned it... FWIW that was always my preference rather >> than >> >> the BTF magic :) The jited image would have to be per-driver like we >> >> do for BPF offload but that's easy to do from the technical >> >> perspective (I doubt many deployments bind the same prog to multiple >> >> HW devices).. >> > >> > +1, sounds like a good alternative (got your reply while typing) >> > I'm not too versed in the rx_desc/rx_queue area, but seems like worst >> > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and >> > parse it out from the pre-populated metadata? >> > >> > Btw, do we also need to think about the redirect case? What happens >> > when I redirect one frame from a device A with one metadata format to >> > a device B with another? > >> Yes, we absolutely do! In fact, to me this (redirects) is the main >> reason why we need the ID in the packet in the first place: when running >> on (say) a veth, an XDP program needs to be able to deal with packets >> from multiple physical NICs. > >> As far as API is concerned, my hope was that we could solve this with a >> CO-RE like approach where the program author just writes something like: > >> hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64); > >> and bpf_get_xdp_hint() is really a macro (or a special kind of >> relocation?) and libbpf would do the following on load: > >> - query the kernel BTF for all possible xdp_hint structs >> - figure out which of them have an 'u64 hw_tstamp' member >> - generate the necessary conditionals / jump table to disambiguate on >> the BTF_ID in the packet > > >> Now, if this is better done by a kfunc I'm not terribly opposed to that >> either, but I'm not sure it's actually better/easier to do in the kernel >> than in libbpf at load time? > > Replied in the other thread, but to reiterate here: then btf_id in the > metadata has to stay and we either pre-generate those bpf_get_xdp_hint() > at libbpf or at kfunc load time level as you mention. > > But the program essentially has to handle all possible hints' btf ids > thrown > at it by the system. Not sure about the performance in this case :-/ > Maybe that's something that can be hidden behind "I might receive forwarded > packets and I know how to handle all metadata format" flag? By default, > we'll pre-generate parsing only for that specific device? I did a simple POC of Jespers xdp-hints with AF-XDP and CNDP (Cloud Native Data Plane). In the cases where my app had access to the HW I didn't need to handle all possible hints... I knew what Drivers were on the system and they were the hints I needed to deal with. So at program init time I registered the relevant BTF_IDs (and some callback functions to handle them) from the NICs that were available to me in a simple tailq (tbh there were so few I could've probably used a static array). When processing the hints then I only needed to invoke the appropriate callback function based on the received BTF_ID. I didn't have a massive chains of if...else if... else statements. In the case where we have redirection to a virtual NIC and we don't necessarily know the underlying hints that are exposed to the app, could we not still use the xdp_hints (as proposed by Jesper) themselves to indicate the relevant drivers to the application? or even indicate them via a map or something?
On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 10/4/22 7:15 PM, Stanislav Fomichev wrote: > > On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote: > >> > >> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote: > >>> +1, sounds like a good alternative (got your reply while typing) > >>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst > >>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and > >>> parse it out from the pre-populated metadata? > >> > >> I'd think so, worst case the driver can put xdp_md into a struct > >> and container_of() to get to its own stack with whatever fields > >> it needs. > > > > Ack, seems like something worth exploring then. > > > > The only issue I see with that is that we'd probably have to extend > > the loading api to pass target xdp device so we can pre-generate > > per-device bytecode for those kfuncs? > > There is an existing attr->prog_ifindex for dev offload purpose. May be we can > re-purpose/re-use some of the offload API. How this kfunc can be presented also > needs some thoughts, could be a new ndo_xxx.... not sure. > > And this potentially will block attaching the same program > > to different drivers/devices? > > Or, Martin, did you maybe have something better in mind? > > If the kfunc/helper is inline, then it will have to be per device. Unless the > bpf prog chooses not to inline which could be an option but I am also not sure > how often the user wants to 'attach' a loaded xdp prog to a different device. > To some extend, the CO-RE hints-loading-code will have to be per device also, no? > > Why I asked the kfunc/helper approach is because, from the set, it seems the > hints has already been available at the driver. The specific knowledge that the > xdp prog missing is how to get the hints from the rx_desc/rx_queue. The > straight forward way to me is to make them (rx_desc/rx_queue) available to xdp > prog and have kfunc/helper to extract the hints from them only if the xdp prog > needs it. The xdp prog can selectively get what hints it needs and then > optionally store them into the meta area in any layout. This sounds like a really good idea to me, well worth exploring. To only have to pay, performance wise, for the metadata you actually use is very important. I did some experiments [1] on the previous patch set of Jesper's and there is substantial overhead added for each metadata enabled (and fetched from the NIC). This is especially important for AF_XDP in zero-copy mode where most packets are directed to user-space (if not, you should be using the regular driver that is optimized for passing packets to the stack or redirecting to other devices). In this case, the user knows exactly what metadata it wants and where in the metadata area it should be located in order to offer the best performance for the application in question. But as you say, your suggestion could potentially offer a good performance upside to the regular XDP path too. [1] https://lore.kernel.org/bpf/CAJ8uoz1XVqVCpkKo18qbkh6jq_Lejk24OwEWCB9cWhokYLEBDQ@mail.gmail.com/ > NETIF_F_XDP_HINTS_BIT probably won't be needed and one less thing to worry in > production. > > > > >>> Btw, do we also need to think about the redirect case? What happens > >>> when I redirect one frame from a device A with one metadata format to > >>> a device B with another? > >> > >> If there is a program on Tx then it'd be trivial - just do the > >> info <-> descriptor translation in the opposite direction than Rx. > > +1 > > >> TBH I'm not sure how it'd be done in the current approach, either. > > Yeah, I think we need more selftest to show how things work. > > > > > Yeah, I don't think it magically works in any case. I'm just trying to > > understand whether it's something we care to support out of the box or > > can punt to the bpf programs themselves and say "if you care about > > forwarding metadata, somehow agree on the format yourself". > > > >> Now I questioned the BTF way and mentioned the Tx-side program in > >> a single thread, I better stop talking... > > > > Forget about btf, hail to the new king - kfunc :-D >
On Wed, 5 Oct 2022 16:19:30 +0200 Jesper Dangaard Brouer wrote: > >> Since you mentioned it... FWIW that was always my preference rather than > >> the BTF magic :) The jited image would have to be per-driver like we > >> do for BPF offload but that's easy to do from the technical > >> perspective (I doubt many deployments bind the same prog to multiple > >> HW devices).. > > On the technical side we do have the ifindex that can be passed along > which is currently used for getting XDP hardware offloading to work. > But last time I tried this, I failed due to BPF tail call maps. FWIW the tail call map should be solvable by enforcing that the map is also pinned and so are all the programs in it. Perhaps I find that less ugly than others.. since that's what the offload path did :) > (It's not going to fly for other reasons, see redirect below).
On 06/10/2022 11.14, Magnus Karlsson wrote: > On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: >> >> On 10/4/22 7:15 PM, Stanislav Fomichev wrote: >>> On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote: >>>> >>>> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote: >>>>> +1, sounds like a good alternative (got your reply while typing) >>>>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst >>>>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and >>>>> parse it out from the pre-populated metadata? >>>> >>>> I'd think so, worst case the driver can put xdp_md into a struct >>>> and container_of() to get to its own stack with whatever fields >>>> it needs. >>> >>> Ack, seems like something worth exploring then. >>> >>> The only issue I see with that is that we'd probably have to extend >>> the loading api to pass target xdp device so we can pre-generate >>> per-device bytecode for those kfuncs? >> >> There is an existing attr->prog_ifindex for dev offload purpose. May be we can >> re-purpose/re-use some of the offload API. How this kfunc can be presented also >> needs some thoughts, could be a new ndo_xxx.... not sure. >>> And this potentially will block attaching the same program >> > to different drivers/devices? >>> Or, Martin, did you maybe have something better in mind? >> >> If the kfunc/helper is inline, then it will have to be per device. Unless the >> bpf prog chooses not to inline which could be an option but I am also not sure >> how often the user wants to 'attach' a loaded xdp prog to a different device. >> To some extend, the CO-RE hints-loading-code will have to be per device also, no? >> >> Why I asked the kfunc/helper approach is because, from the set, it seems the >> hints has already been available at the driver. The specific knowledge that the >> xdp prog missing is how to get the hints from the rx_desc/rx_queue. The >> straight forward way to me is to make them (rx_desc/rx_queue) available to xdp >> prog and have kfunc/helper to extract the hints from them only if the xdp prog >> needs it. The xdp prog can selectively get what hints it needs and then >> optionally store them into the meta area in any layout. > > This sounds like a really good idea to me, well worth exploring. To > only have to pay, performance wise, for the metadata you actually use > is very important. I did some experiments [1] on the previous patch > set of Jesper's and there is substantial overhead added for each > metadata enabled (and fetched from the NIC). This is especially > important for AF_XDP in zero-copy mode where most packets are directed > to user-space (if not, you should be using the regular driver that is > optimized for passing packets to the stack or redirecting to other > devices). In this case, the user knows exactly what metadata it wants > and where in the metadata area it should be located in order to offer > the best performance for the application in question. But as you say, > your suggestion could potentially offer a good performance upside to > the regular XDP path too. Okay, lets revisit this again. And let me explain why I believe this isn't going to fly. I was also my initial though, lets just give XDP BPF-prog direct access to the NIC rx_descriptor, or another BPF-prog populate XDP-hints prior to calling XDP-prog. Going down this path (previously) I learned three things: (1) Understanding/decoding rx_descriptor requires access to the programmers datasheet, because it is very compacted and the mean of the bits depend on other bits and plus current configuration status of the HW. (2) HW have bugs and for certain chip revisions driver will skip some offload hints. Thus, chip revisions need to be exported to BPF-progs and handled appropriately. (3) Sometimes the info is actually not available in the rx_descriptor. Often for HW timestamps, the timestamp need to be read from a HW register. How do we expose this to the BPF-prog? > [1] https://lore.kernel.org/bpf/CAJ8uoz1XVqVCpkKo18qbkh6jq_Lejk24OwEWCB9cWhokYLEBDQ@mail.gmail.com/ Notice that this patchset doesn't block this idea, as it is orthogonal. After we have established a way to express xdp_hints layouts via BTF, then we can still add a pre-XDP BPF-prog that populates the XDP-hints, and squeeze out more performance by skipping some of the offloads that your-specific-XDP-prog are not interested in. --Jesper
On 10/06, Maryam Tahhan wrote: > On 05/10/2022 19:47, sdf@google.com wrote: > > On 10/05, Toke H�iland-J�rgensen wrote: > > > Stanislav Fomichev <sdf@google.com> writes: > > > > > > On Tue, Oct 4, 2022 at 5:59 PM Jakub Kicinski <kuba@kernel.org> > wrote: > > > >> > > > >> On Tue, 4 Oct 2022 17:25:51 -0700 Martin KaFai Lau wrote: > > > >> > A intentionally wild question, what does it take for the driver > > > to return the > > > >> > hints. Is the rx_desc and rx_queue enough? When the xdp prog > > > is calling a > > > >> > kfunc/bpf-helper, like 'hwtstamp = bpf_xdp_get_hwtstamp()', can > > > the driver > > > >> > replace it with some inline bpf code (like how the inline code > > > is generated for > > > >> > the map_lookup helper). The xdp prog can then store the > > > hwstamp in the meta > > > >> > area in any layout it wants. > > > >> > > > >> Since you mentioned it... FWIW that was always my preference > > > rather than > > > >> the BTF magic :) The jited image would have to be per-driver like > we > > > >> do for BPF offload but that's easy to do from the technical > > > >> perspective (I doubt many deployments bind the same prog to > multiple > > > >> HW devices).. > > > > > > > > +1, sounds like a good alternative (got your reply while typing) > > > > I'm not too versed in the rx_desc/rx_queue area, but seems like > worst > > > > case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and > > > > parse it out from the pre-populated metadata? > > > > > > > > Btw, do we also need to think about the redirect case? What happens > > > > when I redirect one frame from a device A with one metadata format > to > > > > a device B with another? > > > > > Yes, we absolutely do! In fact, to me this (redirects) is the main > > > reason why we need the ID in the packet in the first place: when > running > > > on (say) a veth, an XDP program needs to be able to deal with packets > > > from multiple physical NICs. > > > > > As far as API is concerned, my hope was that we could solve this with > a > > > CO-RE like approach where the program author just writes something > like: > > > > > hw_tstamp = bpf_get_xdp_hint("hw_tstamp", u64); > > > > > and bpf_get_xdp_hint() is really a macro (or a special kind of > > > relocation?) and libbpf would do the following on load: > > > > > - query the kernel BTF for all possible xdp_hint structs > > > - figure out which of them have an 'u64 hw_tstamp' member > > > - generate the necessary conditionals / jump table to disambiguate on > > > the BTF_ID in the packet > > > > > > > Now, if this is better done by a kfunc I'm not terribly opposed to > that > > > either, but I'm not sure it's actually better/easier to do in the > kernel > > > than in libbpf at load time? > > > > Replied in the other thread, but to reiterate here: then btf_id in the > > metadata has to stay and we either pre-generate those bpf_get_xdp_hint() > > at libbpf or at kfunc load time level as you mention. > > > > But the program essentially has to handle all possible hints' btf ids > > thrown > > at it by the system. Not sure about the performance in this case :-/ > > Maybe that's something that can be hidden behind "I might receive > forwarded > > packets and I know how to handle all metadata format" flag? By default, > > we'll pre-generate parsing only for that specific device? > I did a simple POC of Jespers xdp-hints with AF-XDP and CNDP (Cloud Native > Data Plane). In the cases where my app had access to the HW I didn't need > to > handle all possible hints... I knew what Drivers were on the system and > they > were the hints I needed to deal with. > So at program init time I registered the relevant BTF_IDs (and some > callback > functions to handle them) from the NICs that were available to me in a > simple tailq (tbh there were so few I could've probably used a static > array). > When processing the hints then I only needed to invoke the appropriate > callback function based on the received BTF_ID. I didn't have a massive > chains of if...else if... else statements. > In the case where we have redirection to a virtual NIC and we don't > necessarily know the underlying hints that are exposed to the app, could > we > not still use the xdp_hints (as proposed by Jesper) themselves to indicate > the relevant drivers to the application? or even indicate them via a map > or > something? Ideally this all should be handled by the common infra (libbpf/libxdp?). We probably don't want every xdp/af_xdp user to custom-implement all this btf_id->layout parsing? That's why the request for a selftest that shows how metadata can be accessed from bpf/af_xdp.
On 05/10/2022 20.43, sdf@google.com wrote: > On 10/05, Jesper Dangaard Brouer wrote: > >> On 04/10/2022 20.26, Stanislav Fomichev wrote: >> > On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer >> > <jbrouer@redhat.com> wrote: >> > > >> > > >> > > On 04/10/2022 01.55, sdf@google.com wrote: >> > > > On 09/07, Jesper Dangaard Brouer wrote: >> > > > > This patchset expose the traditional hardware offload hints to XDP and >> > > > > rely on BTF to expose the layout to users. >> > > > >> > > > > Main idea is that the kernel and NIC drivers simply defines the struct >> > > > > layouts they choose to use for XDP-hints. These XDP-hints structs gets >> > > > > naturally and automatically described via BTF and implicitly exported to >> > > > > users. NIC drivers populate and records their own BTF ID as the last >> > > > > member in XDP metadata area (making it easily accessible by AF_XDP >> > > > > userspace at a known negative offset from packet data start). >> > > > >> > > > > Naming conventions for the structs (xdp_hints_*) is used such that >> > > > > userspace can find and decode the BTF layout and match against the >> > > > > provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting >> > > > > what XDP-hints a driver supports. >> > > > >> > > > > The patch "i40e: Add xdp_hints_union" introduce the idea of creating a >> > > > > union named "xdp_hints_union" in every driver, which contains all >> > > > > xdp_hints_* struct this driver can support. This makes it easier/quicker >> > > > > to find and parse the relevant BTF types. (Seeking input before fixing >> > > > > up all drivers in patchset). >> > > > [...] >> > >> > > > b. Each device defines much denser <device>_xdp_rx_hints struct with the >> > > > metadata that it supports >> > > >> > > Thus, the NIC device is limited to what is defined in UAPI struct >> > > xdp_rx_hints. Again this limits innovation. >> > >> > I guess what I'm missing from your series is the bpf/userspace side. >> > Do you have an example on the bpf side that will work for, say, >> > xdp_hints_ixgbe_timestamp? We have been consuming this from AF_XDP and decoding BTF in userspace and checking BTF IDs in our userspace apps. I will try to codeup consuming this from XDP BPF-progs to get a better feel for that. >> > >> > Suppose, you pass this custom hints btf_id via xdp_md as proposed, > >> I just want to reiterate why we place btf_full_id at the "end inline". >> This makes it easily available for AF_XDP to consume. Plus, we already >> have to write info into this metadata cache-line anyway, thus it's >> almost free. Moving bpf_full_id into xdp_md, will require expanding >> both xdp_buff and xdp_frame (+ extra store for converting >> buff-to-frame). If AF_XDP need this btf_full_id the BPF-prog _could_ >> move/copy it from xdp_md to metadata, but that will just waste cycles, >> why not just store it once in a known location. > >> One option, for convenience, would be to map xdp_md->bpf_full_id to load >> the btf_full_id value from the metadata. But that would essentially be >> syntax-sugar and adds UAPI. > >> > what's the action on the bpf side to consume this? >> > >> > If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly >> > populated at runtime by libbpf? */) { > >> See e.g. bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp) >> AFAIK libbpf will make this a constant at load/setup time, and give us >> dead-code elimination. > > Even with bpf_core_type_id_kernel() you still would have the following: > > if (ctx_hints_btf_id == bpf_core_type_id_kernel(struct xdp_hints_ixgbe)) { > } else if (the same for every driver that has custom hints) { > } > > Toke has a good suggestion on hiding this behind a helper; either > pre-generated on the libbpf side or a kfunc. We should try to hide > this per-device logic if possible; otherwise we'll get to per-device > XDP programs that only work on some special deployments. > OTOH, we'll probably get there with the hints anyway? Well yes, hints is trying to let NIC driver innovate and export HW hints that are specific for a given driver. Thus, we should allow code to get device specific hints. I do like this idea of hiding this behind something. Like libbpf could detect this and apply CO-RE tricks, e.g. based on the struct name starting with xdp_rx_hints___xxx and member rx_timestamp, it could scan entire system (all loaded modules) for xdp_rx_hints_* structs and find those that contain member rx_timestamp, and then expand that to the if-else-if statements matching against IDs and access rx_timestamp at correct offset. Unfortunately this auto expansion will add code that isn't needed for a XDP BPF-prog loaded on a specific physical device (as some IDs will not be able to appear). For the veth case it is useful. Going back to ifindex, if a XDP BPF-prog do have ifindex, then we could limit the expansion to BTF layouts from that driver. It just feels like a lot of syntax-sugar and magic to hide the driver name e.g. "xdp_hints_ixgbe_timestamp" in the C-code. >> > // do something with rx_timestamp >> > // also, handle xdp_hints_ixgbe and then xdp_hints_common ? >> > } else if (ctx_hints_btf_id == xdp_hints_ixgbe) { >> > // do something else >> > // plus explicitly handle xdp_hints_common here? >> > } else { >> > // handle xdp_hints_common >> > } > >> I added a BPF-helper that can tell us if layout if compatible with >> xdp_hints_common, which is basically the only UAPI the patchset >> introduces. >> The handle xdp_hints_common code should be common. >> >> I'm not super happy with the BPF-helper approach, so suggestions are >> welcome. E.g. xdp_md/ctx->is_hint_common could be one approach and >> ctx->has_hint (ctx is often called xdp so it reads xdp->has_hint). >> >> One feature I need from the BPF-helper is to "disable" the xdp_hints and >> allow the BPF-prog to use the entire metadata area for something else >> (avoiding it to be misintrepreted by next prog or after redirect). >> > As mentioned in the previous emails, let's try to have a bpf side > example/selftest for the next round? Yes, I do need to add BPF-prog examples and selftests. I am considering sending next round (still as RFC) without this, to show what Maryam and Magnus settled on for AF_XDP desc option flags. > I also feel like xdp_hints_common is > a bit distracting. It makes the common case easy and it hides the > discussion/complexity about per-device hints. Maybe we can drop this > common case at all? Why can't every driver has a custom hints struct? > If we agree that naming/size will be the same across them (and review > catches/guaranteed that), why do we even care about having common > xdp_hints_common struct? The xdp_hints_common struct is a stepping stone to making this easily consumable from C-code that need to generate SKBs and info for virtio_net 'hdr' desc. David Ahern have been begging me for years to just add this statically to xdp_frame. I have been reluctant, because I think we can come up with a more flexible (less UAPI fixed) way, that both allows kerne-code and BPF-prog to access these fields. I think of this approach as a compromise between these two users. Meaning struct xdp_hints_common can be changed anytime in the kernel C-code and BPF-prog's must access area via BTF/CO-RE. >> > What I'd like to avoid is an xdp program targeting specific drivers. >> > Where possible, we should aim towards something like "if this device >> > has rx_timestamp offload -> use it without depending too much on >> > specific btf_ids. >> > > >> I do understand your wish, and adding rx_timestamps to xdp_hints_common >> would be too easy (and IMHO wasting u64/8-bytes for all packets not >> needing this timestamp). Hopefully we can come up with a good solution >> together. > >> One idea would be to extend libbpf to lookup or translate struct name > >> struct xdp_hints_DRIVER_timestamp { >> __u64 rx_timestamp; >> } __attribute__((preserve_access_index)); > >> into e.g. xdp_hints_i40e_timestamp, if an ifindex was provided when >> loading >> the XDP prog. And the bpf_core_type_id_kernel() result of the struct >> returning id from xdp_hints_i40e_timestamp. > >> But this ideas doesn't really work for the veth redirect use-case :-( >> As veth need to handle xdp_hints from other drivers. > > Agreed. If we want redirect to work, then the parsing should be either > mostly pre-generated by libbpf to include all possible btf ids that > matter; or done similarly by a kfunc. The idea that we can pre-generate > per-device bpf program seems to be out of the window now? > Hmm, the per-device thing could be an optimization that is performed if an ifindex have been provided. BUT for redirect to work, we do need to have the full BTF ID, to identify structs coming from other device drivers and their BTF layout. We have mentioned redirect into veth several times, but the same goes for redirect into AF_XDP, that needs to identify the BTF layout. [...] >> > See above. I think even with your series, that btf_id info should also >> > come via netlink so the programs can query it before loading and do >> > the required adjustments. Otherwise, I'm not sure I understand what I >> > need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too >> > late? I need to know them in advance to at least populate those ids >> > into the bpf program itself? > >> Yes, we need to know these IDs in advance and can. I don't think we need >> the netlink interface, as we can already read out the BTF layout and IDs >> today. I coded it up in userspace, where the intented consumer is AF_XDP >> (as libbpf already does this itself). > >> See this code: >> - >> https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_ids.c >> - >> https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c > > SG, if we can have some convention on the names where we can reliably > parse out all possible structs with the hints, let's rely solely on > vmlinux+vmlinux module btf. > Yes, I am proposing convention on the struct BTF names to find 'xdp_hints_*' that the driver can produce. To make it quicker to find xdp_hints struct in a driver, I am also proposing a 'union' that contains all the xdp_hints struct's. - See "[PATCH 14/18] i40e: Add xdp_hints_union". The BTF effect of this is that each driver will have a xdp_hints_union with same "name". That points to all the other BTF IDs. I am wondering if we can leverage this for CO-RE relocations too. Then you can define your BPF-prog shadow union with the member rx_timestamp (and __attribute__((preserve_access_index))) and let CO-RE/libbpf do the offset adjustments. (But again we are back to which driver BPF-prog are attached on and veth having to handle all possible drivers) [...] >> > > > >> > > > All we need here is for libbpf to, again, do xdp_rx_hints -> >> > > > <device>_xdp_rx_hints translation before it evaluates >> > > > bpf_core_field_exists()? >> > > > >> > > > Thoughts? Any downsides? Am I missing something? >> > > > >> > > >> > > Well, the downside is primarily that this design limits innovation. >> > > >> > > Each time a NIC driver want to introduce a new hardware hint, they have >> > > to update the central UAPI xdp_rx_hints struct first. >> > > >> > > The design in the patchset is to open for innovation. Driver can extend >> > > their own xdp_hints_<driver>_xxx struct(s). They still have to land >> > > their patches upstream, but avoid mangling a central UAPI struct. As >> > > upstream we review driver changes and should focus on sane struct member >> > > naming(+size) especially if this "sounds" like a hint/feature that more >> > > driver are likely to support. With help from BTF relocations, a new >> > > driver can support same hint/feature if naming(+size) match (without >> > > necessary the same offset in the struct). >> > >> > The opposite side of this approach is that we'll have 'ixgbe_hints' >> > with 'rx_timestamp' and 'mvneta_hints' with something like >> > 'rx_tstamp'. > >> Well, as I wrote reviewers should ask drivers to use the same member >> name. > > SG! > >> > > > Also, about the TX side: I feel like the same can be applied there, >> > > > the program works with xdp_tx_hints and libbpf will rewrite to >> > > > <device>_xdp_tx_hints. xdp_tx_hints might have fields like "has_tx_vlan:1"; >> > > > those, presumably, can be relocatable by libbpf as well? >> > > > >> > > >> > > Good to think ahead for TX-side, even-though I think we should focus on >> > > landing RX-side first. >> > > >> > > I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the >> > > common struct xdp_hints_common, without a RX/TX direction indication. >> > > Maybe this is wrong of me, but my thinking was that most of the common >> > > hints can be directly used as TX-side hints. I'm hoping TX-side >> > > xdp-hints will need to do little-to-non adjustment, before using the >> > > hints as TX "instruction". I'm hoping that XDP-redirect will just work >> > > and xmit driver can use XDP-hints area. >> > > >> > > Please correct me if I'm wrong. >> > > The checksum fields hopefully translates to similar TX offload "actions". >> > > The VLAN offload hint should translate directly to TX-side. >> > > >> > > I can easily be convinced we should name it xdp_hints_rx_common from the >> > > start, but then I will propose that xdp_hints_tx_common have the >> > > checksum and VLAN fields+flags at same locations, such that we don't >> > > take any performance hint for moving them to "TX-side" hints, making >> > > XDP-redirect just work. >> > >> > Might be good to think about this beforehand. I agree that most of the >> > layout should hopefully match. However once case that I'm interested >> > in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp >> > in the metadata; for tx, I'm merely setting a flag somewhere to >> > request it for async delivery later (I hope we plan to support that >> > for af_xdp?). So the layout might be completely different :-( >> > > >> Yes, it is definitely in my plans to support handling at TX-completion >> time, so you can extract the TX-wire-timestamp. This is easy for AF_XDP >> as it has the CQ (Completion Queue) step. > >> I'm getting ahead of myself, but for XDP I imagine that driver will >> populate this xdp_tx_hint in DMA TX-completion function, and we can add >> a kfunc "not-a-real-hook" to xdp_return_frame that can run another XDP >> BPF-prog that can inspect the xdp_tx_hint in metadata. > > Can we also place that xdp_tx_hint somewhere in the completion ring > for AF_XDP to consume? Yes, that is basically what I said above. This will be automatic/easy for AF_XDP as it has the CQ (Completion Queue) ring. The packets in the completion ring will still contain the metadata area, which could have been populated with the TX-wire-timestamp. >> At this proposed kfunc xdp_return_frame call point, we likely cannot know >> what driver that produced the xdp_hints metadata either, and thus not >> lock our design or BTF-reloacations to assume which driver is it loaded on. > >> [... cut ... getting too long] --Jesper
On 10/6/22 11:47 AM, Jesper Dangaard Brouer wrote: > > >> I also feel like xdp_hints_common is >> a bit distracting. It makes the common case easy and it hides the >> discussion/complexity about per-device hints. Maybe we can drop this >> common case at all? Why can't every driver has a custom hints struct? >> If we agree that naming/size will be the same across them (and review >> catches/guaranteed that), why do we even care about having common >> xdp_hints_common struct? > > The xdp_hints_common struct is a stepping stone to making this easily > consumable from C-code that need to generate SKBs and info for > virtio_net 'hdr' desc. > > David Ahern have been begging me for years to just add this statically > to xdp_frame. I have been reluctant, because I think we can come up > with a more flexible (less UAPI fixed) way, that both allows kerne-code > and BPF-prog to access these fields. I think of this approach as a > compromise between these two users. > Simple implementation for common - standard - networking features; jump through hoops to use vendor unique features. Isn't that point of standardization? There are multiple use cases where vlans and checksumming requests need to traverse devices on an XDP redirect.
On 10/6/22 8:29 AM, Jesper Dangaard Brouer wrote: > > On 06/10/2022 11.14, Magnus Karlsson wrote: >> On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: >>> >>> On 10/4/22 7:15 PM, Stanislav Fomichev wrote: >>>> On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote: >>>>> >>>>> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote: >>>>>> +1, sounds like a good alternative (got your reply while typing) >>>>>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst >>>>>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and >>>>>> parse it out from the pre-populated metadata? >>>>> >>>>> I'd think so, worst case the driver can put xdp_md into a struct >>>>> and container_of() to get to its own stack with whatever fields >>>>> it needs. >>>> >>>> Ack, seems like something worth exploring then. >>>> >>>> The only issue I see with that is that we'd probably have to extend >>>> the loading api to pass target xdp device so we can pre-generate >>>> per-device bytecode for those kfuncs? >>> >>> There is an existing attr->prog_ifindex for dev offload purpose. May be we can >>> re-purpose/re-use some of the offload API. How this kfunc can be presented also >>> needs some thoughts, could be a new ndo_xxx.... not sure. >>>> And this potentially will block attaching the same program >>> > to different drivers/devices? >>>> Or, Martin, did you maybe have something better in mind? >>> >>> If the kfunc/helper is inline, then it will have to be per device. Unless the >>> bpf prog chooses not to inline which could be an option but I am also not sure >>> how often the user wants to 'attach' a loaded xdp prog to a different device. >>> To some extend, the CO-RE hints-loading-code will have to be per device also, >>> no? >>> >>> Why I asked the kfunc/helper approach is because, from the set, it seems the >>> hints has already been available at the driver. The specific knowledge that the >>> xdp prog missing is how to get the hints from the rx_desc/rx_queue. The >>> straight forward way to me is to make them (rx_desc/rx_queue) available to xdp >>> prog and have kfunc/helper to extract the hints from them only if the xdp prog >>> needs it. The xdp prog can selectively get what hints it needs and then >>> optionally store them into the meta area in any layout. >> >> This sounds like a really good idea to me, well worth exploring. To >> only have to pay, performance wise, for the metadata you actually use >> is very important. I did some experiments [1] on the previous patch >> set of Jesper's and there is substantial overhead added for each >> metadata enabled (and fetched from the NIC). This is especially >> important for AF_XDP in zero-copy mode where most packets are directed >> to user-space (if not, you should be using the regular driver that is >> optimized for passing packets to the stack or redirecting to other >> devices). In this case, the user knows exactly what metadata it wants >> and where in the metadata area it should be located in order to offer >> the best performance for the application in question. But as you say, >> your suggestion could potentially offer a good performance upside to >> the regular XDP path too. Yeah, since we are on this flexible hint layout, after reading the replies in other threads, now I am also not sure why we need a xdp_hints_common and probably I am missing something also. It seems to be most useful in __xdp_build_skb_from_frame. However, the xdp prog can also fill in the xdp_hints_common by itself only when needed instead of having the driver always filling it in. > > Okay, lets revisit this again. And let me explain why I believe this > isn't going to fly. > > I was also my initial though, lets just give XDP BPF-prog direct access > to the NIC rx_descriptor, or another BPF-prog populate XDP-hints prior > to calling XDP-prog. Going down this path (previously) I learned three > things: > > (1) Understanding/decoding rx_descriptor requires access to the > programmers datasheet, because it is very compacted and the mean of the > bits depend on other bits and plus current configuration status of the HW. > > (2) HW have bugs and for certain chip revisions driver will skip some > offload hints. Thus, chip revisions need to be exported to BPF-progs > and handled appropriately. > > (3) Sometimes the info is actually not available in the rx_descriptor. > Often for HW timestamps, the timestamp need to be read from a HW > register. How do we expose this to the BPF-prog? hmm.... may be I am missing those hw specific details here. How would the driver handle the above cases and fill in the xdp_hints in the meta? Can the same code be called by the xdp prog? > >> [1] >> https://lore.kernel.org/bpf/CAJ8uoz1XVqVCpkKo18qbkh6jq_Lejk24OwEWCB9cWhokYLEBDQ@mail.gmail.com/ > > > Notice that this patchset doesn't block this idea, as it is orthogonal. > After we have established a way to express xdp_hints layouts via BTF, > then we can still add a pre-XDP BPF-prog that populates the XDP-hints, > and squeeze out more performance by skipping some of the offloads that > your-specific-XDP-prog are not interested in. > > --Jesper >
On 11/10/2022 08.29, Martin KaFai Lau wrote: > On 10/6/22 8:29 AM, Jesper Dangaard Brouer wrote: >> >> On 06/10/2022 11.14, Magnus Karlsson wrote: >>> On Wed, Oct 5, 2022 at 9:27 PM Martin KaFai Lau >>> <martin.lau@linux.dev> wrote: >>>> >>>> On 10/4/22 7:15 PM, Stanislav Fomichev wrote: >>>>> On Tue, Oct 4, 2022 at 6:24 PM Jakub Kicinski <kuba@kernel.org> wrote: >>>>>> >>>>>> On Tue, 4 Oct 2022 18:02:56 -0700 Stanislav Fomichev wrote: >>>>>>> +1, sounds like a good alternative (got your reply while typing) >>>>>>> I'm not too versed in the rx_desc/rx_queue area, but seems like worst >>>>>>> case that bpf_xdp_get_hwtstamp can probably receive a xdp_md ctx and >>>>>>> parse it out from the pre-populated metadata? >>>>>> >>>>>> I'd think so, worst case the driver can put xdp_md into a struct >>>>>> and container_of() to get to its own stack with whatever fields >>>>>> it needs. >>>>> >>>>> Ack, seems like something worth exploring then. >>>>> >>>>> The only issue I see with that is that we'd probably have to extend >>>>> the loading api to pass target xdp device so we can pre-generate >>>>> per-device bytecode for those kfuncs? >>>> >>>> There is an existing attr->prog_ifindex for dev offload purpose. >>>> May be we can >>>> re-purpose/re-use some of the offload API. How this kfunc can be >>>> presented also >>>> needs some thoughts, could be a new ndo_xxx.... not sure. >>>>> And this potentially will block attaching the same program >>>> > to different drivers/devices? >>>>> Or, Martin, did you maybe have something better in mind? >>>> >>>> If the kfunc/helper is inline, then it will have to be per device. >>>> Unless the >>>> bpf prog chooses not to inline which could be an option but I am >>>> also not sure >>>> how often the user wants to 'attach' a loaded xdp prog to a >>>> different device. >>>> To some extend, the CO-RE hints-loading-code will have to be per >>>> device also, no? >>>> >>>> Why I asked the kfunc/helper approach is because, from the set, it >>>> seems the >>>> hints has already been available at the driver. The specific >>>> knowledge that the >>>> xdp prog missing is how to get the hints from the rx_desc/rx_queue. >>>> The >>>> straight forward way to me is to make them (rx_desc/rx_queue) >>>> available to xdp >>>> prog and have kfunc/helper to extract the hints from them only if >>>> the xdp prog >>>> needs it. The xdp prog can selectively get what hints it needs and >>>> then >>>> optionally store them into the meta area in any layout. >>> >>> This sounds like a really good idea to me, well worth exploring. To >>> only have to pay, performance wise, for the metadata you actually use >>> is very important. I did some experiments [1] on the previous patch >>> set of Jesper's and there is substantial overhead added for each >>> metadata enabled (and fetched from the NIC). This is especially >>> important for AF_XDP in zero-copy mode where most packets are directed >>> to user-space (if not, you should be using the regular driver that is >>> optimized for passing packets to the stack or redirecting to other >>> devices). In this case, the user knows exactly what metadata it wants >>> and where in the metadata area it should be located in order to offer >>> the best performance for the application in question. But as you say, >>> your suggestion could potentially offer a good performance upside to >>> the regular XDP path too. > > Yeah, since we are on this flexible hint layout, after reading the > replies in other threads, now I am also not sure why we need a > xdp_hints_common and probably I am missing something also. It seems to > be most useful in __xdp_build_skb_from_frame. However, the xdp prog can > also fill in the xdp_hints_common by itself only when needed instead of > having the driver always filling it in. > I *want* the XDP-hints to be populated even when no XDP-prog is running. The xdp_frame *is* the mini-SKB concept. These XDP-hints are about adding HW offload hints to this mini-SKB, to allow it grow into a full-SKB with these offloads. I could add this purely as a netstack feature, via extending xdp_frame area with a common struct. For XDP-prog access I could extend xdp_md with fields that gets UAPI rewrite mapped to access these fields. For the AF_XDP users this data becomes harder to access, but an XDP-prog could (spend cycles) moving these offloads into the metadata area, but why not place them there is the first place. I think the main point is that I don't see the XDP-prog as the primary consumer of these hints. One reason/use-case for letting XDP-prog access these hints prior to creating a full-SKB is to help fixing up (or providing) offload hints. The mvneta driver patch highlight this as HW have limited hints, which an XDP-prog can provide prior to calling netstack. In this patchset I'm trying to balance the different users. And via BTF I'm trying hard not to create more UAPI (e.g. more fixed fields avail in xdp_md that we cannot get rid of). And trying to add driver flexibility on-top of the common struct. This flexibility seems to be stalling the patchset as we haven't found the perfect way to express this (yet) given BTF layout is per driver. >> >> Okay, lets revisit this again. And let me explain why I believe this >> isn't going to fly. >> >> I was also my initial though, lets just give XDP BPF-prog direct access >> to the NIC rx_descriptor, or another BPF-prog populate XDP-hints prior >> to calling XDP-prog. Going down this path (previously) I learned three >> things: >> >> (1) Understanding/decoding rx_descriptor requires access to the >> programmers datasheet, because it is very compacted and the mean of the >> bits depend on other bits and plus current configuration status of the >> HW. >> >> (2) HW have bugs and for certain chip revisions driver will skip some >> offload hints. Thus, chip revisions need to be exported to BPF-progs >> and handled appropriately. >> >> (3) Sometimes the info is actually not available in the rx_descriptor. >> Often for HW timestamps, the timestamp need to be read from a HW >> register. How do we expose this to the BPF-prog? > > hmm.... may be I am missing those hw specific details here. How would > the driver handle the above cases and fill in the xdp_hints in the > meta? Can the same code be called by the xdp prog? > As I mentioned above, I want the XDP-hints to be populated even when no XDP-prog is running. I don't want the dependency on loading an XDP-prog to get the hints populated, as e.g. netstack is one of the users. >> >> Notice that this patchset doesn't block this idea, as it is orthogonal. >> After we have established a way to express xdp_hints layouts via BTF, >> then we can still add a pre-XDP BPF-prog that populates the XDP-hints, >> and squeeze out more performance by skipping some of the offloads that >> your-specific-XDP-prog are not interested in. >> >> --Jesper >> >