Message ID | cover.1726935917.git.lorenzo@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Add XDP rx hw hints support performing XDP_REDIRECT | expand |
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Sat, 21 Sep 2024 18:52:56 +0200 > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame &xdp_buff is on the stack. &xdp_frame consumes headroom. IOW they're size-sensitive and putting metadata directly there might play bad; if not now, then later. Our idea (me + Toke) was as follows: - new BPF kfunc to build generic meta. If called, the driver builds a generic meta with hash, csum etc., in the data_meta area. Yes, this also consumes headroom, but only when the corresponding func is called. Introducing new fields like you're doing will consume it unconditionally; - when &xdp_frame gets converted to sk_buff, the function checks whether data_meta contains a generic structure filled with hints. We also thought about &skb_shared_info, but it's also size-sensitive as it consumes tailroom. > one as a container to store the already supported xdp rx hw hints (rx_hash > and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the > eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able > to set the skb metadata converting the xdp_buff/xdp_frame to a skb. Thanks, Olek
On 21/09/2024 22.17, Alexander Lobakin wrote: > From: Lorenzo Bianconi <lorenzo@kernel.org> > Date: Sat, 21 Sep 2024 18:52:56 +0200 > >> This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame > > &xdp_buff is on the stack. > &xdp_frame consumes headroom. > > IOW they're size-sensitive and putting metadata directly there might > play bad; if not now, then later. > > Our idea (me + Toke) was as follows: > > - new BPF kfunc to build generic meta. If called, the driver builds a > generic meta with hash, csum etc., in the data_meta area. I do agree that it should be the XDP prog (via a new BPF kfunc) that decide if xdp_frame should be updated to contain a generic meta struct. *BUT* I think we should use the xdp_frame area, and not the xdp->data_meta area. A details is that I think this kfunc should write data directly into xdp_frame area, even then we are only operating on the xdp_buff, as we do have access to the area xdp_frame will be created in. When using data_meta area, then netstack encap/decap needs to move the data_meta area (extra cycles). The xdp_frame area (live in top) don't have this issue. It is easier to allow xdp_frame area to survive longer together with the SKB. Today we "release" this xdp_frame area to be used by SKB for extra headroom (see xdp_scrub_frame). I can imagine that we can move SKB fields to this area, and reduce the size of the SKB alloc. (This then becomes the mini-SKB we discussed a couple of years ago). > Yes, this also consumes headroom, but only when the corresponding func > is called. Introducing new fields like you're doing will consume it > unconditionally; We agree on the kfunc call marks area as consumed/in-use. We can extend xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but xdp_frame->flags can be used for marking this area as used or not. > - when &xdp_frame gets converted to sk_buff, the function checks whether > data_meta contains a generic structure filled with hints. > Agree, but take data from xdp_frame->xdp_rx_meta. When XDP returns XDP_PASS, then I also want to see this data applied to the SKB. In patchset[1] Yan called this xdp_frame_fixup_skb_offloading() and xdp_buff_fixup_skb_offloading(). (Perhaps "fixup" isn't the right term, "apply" is perhaps better). Having this generic-name allow us to extend with newer offloads, and eventually move members out of SKB. We called it "fixup", because our use-case is that our XDP load-balancer (Unimog) XDP_TX bounce packets with in GRE header encap, and on the receiving NIC (due to encap) we lost the HW hash/csum, which we want to transfer from the original NIC, decap in XDP and apply the original HW hash/csum via this "fixup" call. --Jesper [1] https://lore.kernel.org/all/cover.1718919473.git.yan@cloudflare.com/ > We also thought about &skb_shared_info, but it's also size-sensitive as > it consumes tailroom. > >> one as a container to store the already supported xdp rx hw hints (rx_hash >> and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the >> eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able >> to set the skb metadata converting the xdp_buff/xdp_frame to a skb. > > Thanks, > Olek
> From: Lorenzo Bianconi <lorenzo@kernel.org> > Date: Sat, 21 Sep 2024 18:52:56 +0200 > > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame > > &xdp_buff is on the stack. > &xdp_frame consumes headroom. ack, right. > > IOW they're size-sensitive and putting metadata directly there might > play bad; if not now, then later. I was thinking to use a TLV approach for it (so a variable struct), but then I decided to implement the simplest solution for the moment since, using TLV, we would need to add parsing logic and waste at least 2B for each meta info to store the type and length. Moreover, with XDP we have 256B available for headeroom and for xdp_frame we would use the same cacheline of the current implementation: struct xdp_frame { void * data; /* 0 8 */ u16 len; /* 8 2 */ u16 headroom; /* 10 2 */ u32 metasize; /* 12 4 */ struct xdp_mem_info mem; /* 16 8 */ struct net_device * dev_rx; /* 24 8 */ u32 frame_sz; /* 32 4 */ u32 flags; /* 36 4 */ struct xdp_rx_meta rx_meta; /* 40 12 */ /* size: 56, cachelines: 1, members: 9 */ /* padding: 4 */ /* last cacheline: 56 bytes */ }; Anyway I do not have a strong opinion about it and I am fine to covert the current implementation to a TLV one if we agree on it. > > Our idea (me + Toke) was as follows: > > - new BPF kfunc to build generic meta. If called, the driver builds a > generic meta with hash, csum etc., in the data_meta area. > Yes, this also consumes headroom, but only when the corresponding func > is called. Introducing new fields like you're doing will consume it > unconditionally; ack, I am currently reusing the kfuncs added by Stanislav but I agree it is better to add a new one to store the rx hw hints info, I will work on it. > - when &xdp_frame gets converted to sk_buff, the function checks whether > data_meta contains a generic structure filled with hints. > > We also thought about &skb_shared_info, but it's also size-sensitive as > it consumes tailroom. for rx_timestamp we can reuse the field available in the skb_shared_info. Regards, Lorenzo > > > one as a container to store the already supported xdp rx hw hints (rx_hash > > and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the > > eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able > > to set the skb metadata converting the xdp_buff/xdp_frame to a skb. > > Thanks, > Olek >
> > > On 21/09/2024 22.17, Alexander Lobakin wrote: > > From: Lorenzo Bianconi <lorenzo@kernel.org> > > Date: Sat, 21 Sep 2024 18:52:56 +0200 > > > > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame > > > > &xdp_buff is on the stack. > > &xdp_frame consumes headroom. > > > > IOW they're size-sensitive and putting metadata directly there might > > play bad; if not now, then later. > > > > Our idea (me + Toke) was as follows: > > > > - new BPF kfunc to build generic meta. If called, the driver builds a > > generic meta with hash, csum etc., in the data_meta area. > > I do agree that it should be the XDP prog (via a new BPF kfunc) that > decide if xdp_frame should be updated to contain a generic meta struct. > *BUT* I think we should use the xdp_frame area, and not the > xdp->data_meta area. ack, I will add a new kfunc for it. > > A details is that I think this kfunc should write data directly into > xdp_frame area, even then we are only operating on the xdp_buff, as we > do have access to the area xdp_frame will be created in. this would avoid to copy it when we convert from xdp_buff to xdp_frame, nice :) > > > When using data_meta area, then netstack encap/decap needs to move the > data_meta area (extra cycles). The xdp_frame area (live in top) don't > have this issue. > > It is easier to allow xdp_frame area to survive longer together with the > SKB. Today we "release" this xdp_frame area to be used by SKB for extra > headroom (see xdp_scrub_frame). I can imagine that we can move SKB > fields to this area, and reduce the size of the SKB alloc. (This then > becomes the mini-SKB we discussed a couple of years ago). > > > > Yes, this also consumes headroom, but only when the corresponding func > > is called. Introducing new fields like you're doing will consume it > > unconditionally; > > We agree on the kfunc call marks area as consumed/in-use. We can extend > xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but > xdp_frame->flags can be used for marking this area as used or not. the only downside with respect to a TLV approach would be to consume all the xdp_rx_meta as soon as we set a single xdp rx hw hint in it, right? The upside is it is easier and it requires less instructions. > > > > - when &xdp_frame gets converted to sk_buff, the function checks whether > > data_meta contains a generic structure filled with hints. > > > > Agree, but take data from xdp_frame->xdp_rx_meta. > > When XDP returns XDP_PASS, then I also want to see this data applied to > the SKB. In patchset[1] Yan called this xdp_frame_fixup_skb_offloading() > and xdp_buff_fixup_skb_offloading(). (Perhaps "fixup" isn't the right > term, "apply" is perhaps better). Having this generic-name allow us to > extend with newer offloads, and eventually move members out of SKB. > > We called it "fixup", because our use-case is that our XDP load-balancer > (Unimog) XDP_TX bounce packets with in GRE header encap, and on the > receiving NIC (due to encap) we lost the HW hash/csum, which we want to > transfer from the original NIC, decap in XDP and apply the original HW > hash/csum via this "fixup" call. I already set skb metadata converting xdp_frame into a skb in __xdp_build_skb_from_frame() but I can collect all this logic in a single routine. Regards, Lorenzo > > --Jesper > > [1] https://lore.kernel.org/all/cover.1718919473.git.yan@cloudflare.com/ > > > We also thought about &skb_shared_info, but it's also size-sensitive as > > it consumes tailroom. > > > > > one as a container to store the already supported xdp rx hw hints (rx_hash > > > and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the > > > eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able > > > to set the skb metadata converting the xdp_buff/xdp_frame to a skb. > > > > Thanks, > > Olek >
Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes: >> >> >> On 21/09/2024 22.17, Alexander Lobakin wrote: >> > From: Lorenzo Bianconi <lorenzo@kernel.org> >> > Date: Sat, 21 Sep 2024 18:52:56 +0200 >> > >> > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame >> > >> > &xdp_buff is on the stack. >> > &xdp_frame consumes headroom. >> > >> > IOW they're size-sensitive and putting metadata directly there might >> > play bad; if not now, then later. >> > >> > Our idea (me + Toke) was as follows: >> > >> > - new BPF kfunc to build generic meta. If called, the driver builds a >> > generic meta with hash, csum etc., in the data_meta area. >> >> I do agree that it should be the XDP prog (via a new BPF kfunc) that >> decide if xdp_frame should be updated to contain a generic meta struct. >> *BUT* I think we should use the xdp_frame area, and not the >> xdp->data_meta area. > > ack, I will add a new kfunc for it. > >> >> A details is that I think this kfunc should write data directly into >> xdp_frame area, even then we are only operating on the xdp_buff, as we >> do have access to the area xdp_frame will be created in. > > this would avoid to copy it when we convert from xdp_buff to xdp_frame, nice :) > >> >> >> When using data_meta area, then netstack encap/decap needs to move the >> data_meta area (extra cycles). The xdp_frame area (live in top) don't >> have this issue. >> >> It is easier to allow xdp_frame area to survive longer together with the >> SKB. Today we "release" this xdp_frame area to be used by SKB for extra >> headroom (see xdp_scrub_frame). I can imagine that we can move SKB >> fields to this area, and reduce the size of the SKB alloc. (This then >> becomes the mini-SKB we discussed a couple of years ago). >> >> >> > Yes, this also consumes headroom, but only when the corresponding func >> > is called. Introducing new fields like you're doing will consume it >> > unconditionally; >> >> We agree on the kfunc call marks area as consumed/in-use. We can extend >> xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but >> xdp_frame->flags can be used for marking this area as used or not. > > the only downside with respect to a TLV approach would be to consume all the > xdp_rx_meta as soon as we set a single xdp rx hw hint in it, right? > The upside is it is easier and it requires less instructions. FYI, we also had a discussion related to this at LPC on Friday, in this session: https://lpc.events/event/18/contributions/1935/ The context here was that Arthur and Jakub want to also support extended rich metadata all the way through the SKB path, and are looking at the same area used for XDP metadata to store it. So there's a need to manage both the kernel's own usage of that area, and userspace/BPF usage of it. I'll try to summarise some of the points of that discussion (all interpretations are my own, of course): - We want something that can be carried with a frame all the way from the XDP layer, through all SKB layers and to userspace (to replace the use of skb->mark for this purpose). - We want different applications running on the system (of which the kernel itself if one, cf this discussion) to be able to share this field, without having to have an out of band registry (like a Github repository where applications can agree on which bits to use). Which probably means that the kernel needs to be in the loop somehow to explicitly allocate space in the metadata area and track offsets. - Having an explicit API to access this from userspace, without having to go through BPF (i.e., a socket- or CMSG-based API) would be useful. The TLV format was one of the suggestions in Arthur and Jakub's talk, but AFAICT, there was not a lot of enthusiasm about this in the room (myself included), because of the parsing overhead and complexity. I believe the alternative that was seen as most favourable was a map lookup-style API, where applications can request a metadata area of arbitrary size and get an ID assigned that they can then use to set/get values in the data path. So, sketching this out, this could be realised by something like: /* could be called from BPF, or through netlink or sysfs; may fail, if * there is no more space */ int metadata_id = register_packet_metadata_field(sizeof(struct my_meta)); The ID is just an opaque identifier that can then be passed to getter/setter functions (for both SKB and XDP), like: ret = bpf_set_packet_metadata_field(pkt, metadata_id, &my_meta_value, sizeof(my_meta_value)) ret = bpf_get_packet_metadata_field(pkt, metadata_id, &my_meta_value, sizeof(my_meta_value)) On the kernel side, the implementation would track registered fields in a global structure somewhere, say: struct pkt_metadata_entry { int id; u8 sz; u8 offset; u8 bit; }; struct pkt_metadata_registry { /* allocated as a system-wide global */ u8 num_entries; u8 total_size; struct pkt_metadata_entry entries[MAX_ENTRIES]; }; struct xdp_rx_meta { /* at then end of xdp_frame */ u8 sz; /* set to pkt_metadata_registry->total_size on alloc */ u8 fields_set; /* bitmap of fields that have been set, see below */ u8 data[]; }; int register_packet_metadata_field(u8 size) { struct pkt_metadata_registry *reg = get_global_registry(); struct pkt_metadata_entry *entry; if (size + reg->total_size > MAX_METADATA_SIZE) return -ENOSPC; entry = ®->entries[reg->num_entries++]; entry->id = assign_id(); entry->sz = size; entry->offset = reg->total_size; entry->bit = reg->num_entries - 1; reg->total_size += size; return entry->id; } int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void *value, size_t sz) { struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); if (!entry) return -ENOENT; if (entry->sz != sz) return -EINVAL; /* user error */ if (frm->rx_meta.sz < entry->offset + sz) return -EFAULT; /* entry allocated after xdp_frame was initialised */ memcpy(&frm->rx_meta.data + entry->offset, value, sz); frm->rx_meta.fields_set |= BIT(entry->bit); return 0; } int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void *value, size_t sz) { struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); if (!entry) return -ENOENT; if (entry->sz != sz) return -EINVAL; if (frm->rx_meta.sz < entry->offset + sz) return -EFAULT; /* entry allocated after xdp_frame was initialised */ if (!(frm->rx_meta.fields_set & BIT(entry->bit))) return -ENOENT; memcpy(value, &frm->rx_meta.data + entry->offset, sz); return 0; } I'm hinting at some complications here (with the EFAULT return) that needs to be resolved: there is no guarantee that a given packet will be in sync with the current status of the registered metadata, so we need explicit checks for this. If metadata entries are de-registered again this also means dealing with holes and/or reshuffling the metadata layout to reuse the released space (incidentally, this is the one place where a TLV format would have advantages). The nice thing about an API like this, though, is that it's extensible, and the kernel itself can be just another consumer of it for the metadata fields Lorenzo is adding in this series. I.e., we could just pre-define some IDs for metadata vlan, timestamp etc, and use the same functions as above from within the kernel to set and get those values; using the registry, there could even be an option to turn those off if an application wants more space for its own usage. Or, alternatively, we could keep the kernel-internal IDs hardcoded and always allocated, and just use the getter/setter functions as the BPF API for accessing them. -Toke
> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes: > > >> > >> > >> On 21/09/2024 22.17, Alexander Lobakin wrote: > >> > From: Lorenzo Bianconi <lorenzo@kernel.org> > >> > Date: Sat, 21 Sep 2024 18:52:56 +0200 > >> > > >> > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame > >> > > >> > &xdp_buff is on the stack. > >> > &xdp_frame consumes headroom. > >> > > >> > IOW they're size-sensitive and putting metadata directly there might > >> > play bad; if not now, then later. > >> > > >> > Our idea (me + Toke) was as follows: > >> > > >> > - new BPF kfunc to build generic meta. If called, the driver builds a > >> > generic meta with hash, csum etc., in the data_meta area. > >> > >> I do agree that it should be the XDP prog (via a new BPF kfunc) that > >> decide if xdp_frame should be updated to contain a generic meta struct. > >> *BUT* I think we should use the xdp_frame area, and not the > >> xdp->data_meta area. > > > > ack, I will add a new kfunc for it. > > > >> > >> A details is that I think this kfunc should write data directly into > >> xdp_frame area, even then we are only operating on the xdp_buff, as we > >> do have access to the area xdp_frame will be created in. > > > > this would avoid to copy it when we convert from xdp_buff to xdp_frame, nice :) > > > >> > >> > >> When using data_meta area, then netstack encap/decap needs to move the > >> data_meta area (extra cycles). The xdp_frame area (live in top) don't > >> have this issue. > >> > >> It is easier to allow xdp_frame area to survive longer together with the > >> SKB. Today we "release" this xdp_frame area to be used by SKB for extra > >> headroom (see xdp_scrub_frame). I can imagine that we can move SKB > >> fields to this area, and reduce the size of the SKB alloc. (This then > >> becomes the mini-SKB we discussed a couple of years ago). > >> > >> > >> > Yes, this also consumes headroom, but only when the corresponding func > >> > is called. Introducing new fields like you're doing will consume it > >> > unconditionally; > >> > >> We agree on the kfunc call marks area as consumed/in-use. We can extend > >> xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but > >> xdp_frame->flags can be used for marking this area as used or not. > > > > the only downside with respect to a TLV approach would be to consume all the > > xdp_rx_meta as soon as we set a single xdp rx hw hint in it, right? > > The upside is it is easier and it requires less instructions. > > FYI, we also had a discussion related to this at LPC on Friday, in this > session: https://lpc.events/event/18/contributions/1935/ Hi Toke, thx for the pointer > > The context here was that Arthur and Jakub want to also support extended > rich metadata all the way through the SKB path, and are looking at the > same area used for XDP metadata to store it. So there's a need to manage > both the kernel's own usage of that area, and userspace/BPF usage of it. it would be cool if we can collaborate on it. > > I'll try to summarise some of the points of that discussion (all > interpretations are my own, of course): > > - We want something that can be carried with a frame all the way from > the XDP layer, through all SKB layers and to userspace (to replace the > use of skb->mark for this purpose). > > - We want different applications running on the system (of which the > kernel itself if one, cf this discussion) to be able to share this > field, without having to have an out of band registry (like a Github > repository where applications can agree on which bits to use). Which > probably means that the kernel needs to be in the loop somehow to > explicitly allocate space in the metadata area and track offsets. > > - Having an explicit API to access this from userspace, without having > to go through BPF (i.e., a socket- or CMSG-based API) would be useful. > > > The TLV format was one of the suggestions in Arthur and Jakub's talk, > but AFAICT, there was not a lot of enthusiasm about this in the room > (myself included), because of the parsing overhead and complexity. I > believe the alternative that was seen as most favourable was a map > lookup-style API, where applications can request a metadata area of > arbitrary size and get an ID assigned that they can then use to set/get > values in the data path. > > So, sketching this out, this could be realised by something like: > > /* could be called from BPF, or through netlink or sysfs; may fail, if > * there is no more space > */ > int metadata_id = register_packet_metadata_field(sizeof(struct my_meta)); > > The ID is just an opaque identifier that can then be passed to > getter/setter functions (for both SKB and XDP), like: > > ret = bpf_set_packet_metadata_field(pkt, metadata_id, > &my_meta_value, sizeof(my_meta_value)) > > ret = bpf_get_packet_metadata_field(pkt, metadata_id, > &my_meta_value, sizeof(my_meta_value)) > > > On the kernel side, the implementation would track registered fields in > a global structure somewhere, say: > > struct pkt_metadata_entry { > int id; > u8 sz; > u8 offset; > u8 bit; > }; > > struct pkt_metadata_registry { /* allocated as a system-wide global */ > u8 num_entries; > u8 total_size; > struct pkt_metadata_entry entries[MAX_ENTRIES]; > }; > > struct xdp_rx_meta { /* at then end of xdp_frame */ > u8 sz; /* set to pkt_metadata_registry->total_size on alloc */ > u8 fields_set; /* bitmap of fields that have been set, see below */ > u8 data[]; > }; > > int register_packet_metadata_field(u8 size) { > struct pkt_metadata_registry *reg = get_global_registry(); > struct pkt_metadata_entry *entry; > > if (size + reg->total_size > MAX_METADATA_SIZE) > return -ENOSPC; > > entry = ®->entries[reg->num_entries++]; > entry->id = assign_id(); > entry->sz = size; > entry->offset = reg->total_size; > entry->bit = reg->num_entries - 1; > reg->total_size += size; > > return entry->id; > } > > int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void > *value, size_t sz) > { > struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); > > if (!entry) > return -ENOENT; > > if (entry->sz != sz) > return -EINVAL; /* user error */ > > if (frm->rx_meta.sz < entry->offset + sz) > return -EFAULT; /* entry allocated after xdp_frame was initialised */ > > memcpy(&frm->rx_meta.data + entry->offset, value, sz); > frm->rx_meta.fields_set |= BIT(entry->bit); > > return 0; > } > > int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void > *value, size_t sz) > { > struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); > > if (!entry) > return -ENOENT; > > if (entry->sz != sz) > return -EINVAL; > > if (frm->rx_meta.sz < entry->offset + sz) > return -EFAULT; /* entry allocated after xdp_frame was initialised */ > > if (!(frm->rx_meta.fields_set & BIT(entry->bit))) > return -ENOENT; > > memcpy(value, &frm->rx_meta.data + entry->offset, sz); > > return 0; > } > > I'm hinting at some complications here (with the EFAULT return) that > needs to be resolved: there is no guarantee that a given packet will be > in sync with the current status of the registered metadata, so we need > explicit checks for this. If metadata entries are de-registered again > this also means dealing with holes and/or reshuffling the metadata > layout to reuse the released space (incidentally, this is the one place > where a TLV format would have advantages). I like this approach but it seems to me more suitable for 'sw' metadata (this is main Arthur and Jakub use case iiuc) where the userspace would enable/disable these functionalities system-wide. Regarding device hw metadata (e.g. checksum offload) I can see some issues since on a system we can have multiple NICs with different capabilities. If we consider current codebase, stmmac driver supports only rx timestamp, while mlx5 supports all of them. In a theoretical system with these two NICs, since pkt_metadata_registry is global system-wide, we will end-up with quite a lot of holes for the stmmac, right? (I am not sure if this case is relevant or not). In other words, we will end-up with a fixed struct for device rx hw metadata (like xdp_rx_meta). So I am wondering if we really need all this complexity for xdp rx hw metadata? Maybe we can start with a simple approach for xdp rx hw metadata putting the struct in xdp_frame as suggested by Jesper and covering the most common use-cases. We can then integrate this approach with Arthur/Jakub's solution without introducing any backward compatibility issue since these field are not visible to userspace. Regards, Lorenzo > > The nice thing about an API like this, though, is that it's extensible, > and the kernel itself can be just another consumer of it for the > metadata fields Lorenzo is adding in this series. I.e., we could just > pre-define some IDs for metadata vlan, timestamp etc, and use the same > functions as above from within the kernel to set and get those values; > using the registry, there could even be an option to turn those off if > an application wants more space for its own usage. Or, alternatively, we > could keep the kernel-internal IDs hardcoded and always allocated, and > just use the getter/setter functions as the BPF API for accessing them. > > -Toke >
Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes: >> I'm hinting at some complications here (with the EFAULT return) that >> needs to be resolved: there is no guarantee that a given packet will be >> in sync with the current status of the registered metadata, so we need >> explicit checks for this. If metadata entries are de-registered again >> this also means dealing with holes and/or reshuffling the metadata >> layout to reuse the released space (incidentally, this is the one place >> where a TLV format would have advantages). > > I like this approach but it seems to me more suitable for 'sw' metadata > (this is main Arthur and Jakub use case iiuc) where the userspace would > enable/disable these functionalities system-wide. > Regarding device hw metadata (e.g. checksum offload) I can see some issues > since on a system we can have multiple NICs with different capabilities. > If we consider current codebase, stmmac driver supports only rx timestamp, > while mlx5 supports all of them. In a theoretical system with these two > NICs, since pkt_metadata_registry is global system-wide, we will end-up > with quite a lot of holes for the stmmac, right? (I am not sure if this > case is relevant or not). In other words, we will end-up with a fixed > struct for device rx hw metadata (like xdp_rx_meta). So I am wondering > if we really need all this complexity for xdp rx hw metadata? Well, the "holes" will be there anyway (in your static struct approach). They would just correspond to parts of the "struct xdp_rx_meta" being unset. What the "userspace can turn off the fields system wide" would accomplish is to *avoid* the holes if you know that you will never need them. E.g., say a system administrator know that they have no networks that use (offloaded) VLANs. They could then disable the vlan offload field system-wide, and thus reclaim the four bytes taken up by the "vlan" member of struct xdp_rx_meta, freeing that up for use by application metadata. However, it may well be that the complexity of allowing fields to be turned off is not worth the gains. At least as long as there are only the couple of HW metadata fields that we have currently. Having the flexibility to change our minds later would be good, though, which is mostly a matter of making the API exposed to BPF and/or userspace flexible enough to allow us to move things around in memory in the future. Which was basically my thought with the API I sketched out in the previous email. I.e., you could go: ret = bpf_get_packet_metadata_field(pkt, METADATA_ID_HW_HASH, &my_hash_vlaue, sizeof(u32)) ...and the METADATA_ID_HW_HASH would be a constant defined by the kernel, for which the bpf_get_packet_metadata_field() kfunc just has a hardcoded lookup into struct xdp_rx_meta. And then, if we decide to move the field in the future, we just change the kfunc implementation, with no impact to the BPF programs calling it. > Maybe we can start with a simple approach for xdp rx hw metadata > putting the struct in xdp_frame as suggested by Jesper and covering > the most common use-cases. We can then integrate this approach with > Arthur/Jakub's solution without introducing any backward compatibility > issue since these field are not visible to userspace. Yes, this is basically the gist of my suggestion (as I hopefully managed to clarify above): Expose the fields via an API that is flexible enough that we can move things around should the need arise, *and* which can co-exist with the user-defined application metadata. -Toke
On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > FYI, we also had a discussion related to this at LPC on Friday, in this > session: https://lpc.events/event/18/contributions/1935/ > > The context here was that Arthur and Jakub want to also support extended > rich metadata all the way through the SKB path, and are looking at the > same area used for XDP metadata to store it. So there's a need to manage > both the kernel's own usage of that area, and userspace/BPF usage of it. > > I'll try to summarise some of the points of that discussion (all > interpretations are my own, of course): > > - We want something that can be carried with a frame all the way from > the XDP layer, through all SKB layers and to userspace (to replace the > use of skb->mark for this purpose). > > - We want different applications running on the system (of which the > kernel itself if one, cf this discussion) to be able to share this > field, without having to have an out of band registry (like a Github > repository where applications can agree on which bits to use). Which > probably means that the kernel needs to be in the loop somehow to > explicitly allocate space in the metadata area and track offsets. > > - Having an explicit API to access this from userspace, without having > to go through BPF (i.e., a socket- or CMSG-based API) would be useful. > Thanks for looping us in, and the great summary Toke! > The TLV format was one of the suggestions in Arthur and Jakub's talk, > but AFAICT, there was not a lot of enthusiasm about this in the room > (myself included), because of the parsing overhead and complexity. I > believe the alternative that was seen as most favourable was a map > lookup-style API, where applications can request a metadata area of > arbitrary size and get an ID assigned that they can then use to set/get > values in the data path. > > So, sketching this out, this could be realised by something like: > > /* could be called from BPF, or through netlink or sysfs; may fail, if > * there is no more space > */ > int metadata_id = register_packet_metadata_field(sizeof(struct my_meta)); > > The ID is just an opaque identifier that can then be passed to > getter/setter functions (for both SKB and XDP), like: > > ret = bpf_set_packet_metadata_field(pkt, metadata_id, > &my_meta_value, sizeof(my_meta_value)) > > ret = bpf_get_packet_metadata_field(pkt, metadata_id, > &my_meta_value, sizeof(my_meta_value)) > > > On the kernel side, the implementation would track registered fields in > a global structure somewhere, say: > > struct pkt_metadata_entry { > int id; > u8 sz; > u8 offset; > u8 bit; > }; > > struct pkt_metadata_registry { /* allocated as a system-wide global */ > u8 num_entries; > u8 total_size; > struct pkt_metadata_entry entries[MAX_ENTRIES]; > }; > > struct xdp_rx_meta { /* at then end of xdp_frame */ > u8 sz; /* set to pkt_metadata_registry->total_size on alloc */ > u8 fields_set; /* bitmap of fields that have been set, see below */ > u8 data[]; > }; > > int register_packet_metadata_field(u8 size) { > struct pkt_metadata_registry *reg = get_global_registry(); > struct pkt_metadata_entry *entry; > > if (size + reg->total_size > MAX_METADATA_SIZE) > return -ENOSPC; > > entry = ®->entries[reg->num_entries++]; > entry->id = assign_id(); > entry->sz = size; > entry->offset = reg->total_size; > entry->bit = reg->num_entries - 1; > reg->total_size += size; > > return entry->id; > } > > int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void > *value, size_t sz) > { > struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); > > if (!entry) > return -ENOENT; > > if (entry->sz != sz) > return -EINVAL; /* user error */ > > if (frm->rx_meta.sz < entry->offset + sz) > return -EFAULT; /* entry allocated after xdp_frame was initialised */ > > memcpy(&frm->rx_meta.data + entry->offset, value, sz); > frm->rx_meta.fields_set |= BIT(entry->bit); > > return 0; > } > > int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void > *value, size_t sz) > { > struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); > > if (!entry) > return -ENOENT; > > if (entry->sz != sz) > return -EINVAL; > > if (frm->rx_meta.sz < entry->offset + sz) > return -EFAULT; /* entry allocated after xdp_frame was initialised */ > > if (!(frm->rx_meta.fields_set & BIT(entry->bit))) > return -ENOENT; > > memcpy(value, &frm->rx_meta.data + entry->offset, sz); > > return 0; > } > > I'm hinting at some complications here (with the EFAULT return) that > needs to be resolved: there is no guarantee that a given packet will be > in sync with the current status of the registered metadata, so we need > explicit checks for this. If metadata entries are de-registered again > this also means dealing with holes and/or reshuffling the metadata > layout to reuse the released space (incidentally, this is the one place > where a TLV format would have advantages). > > The nice thing about an API like this, though, is that it's extensible, > and the kernel itself can be just another consumer of it for the > metadata fields Lorenzo is adding in this series. I.e., we could just > pre-define some IDs for metadata vlan, timestamp etc, and use the same > functions as above from within the kernel to set and get those values; > using the registry, there could even be an option to turn those off if > an application wants more space for its own usage. Or, alternatively, we > could keep the kernel-internal IDs hardcoded and always allocated, and > just use the getter/setter functions as the BPF API for accessing them. That's exactly what I'm thinking of too, a simple API like: get(u8 key, u8 len, void *val); set(u8 key, u8 len, void *val); With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. If a NIC doesn't support a certain well-known metadata, the key wouldn't be set, and get() would return ENOENT. I think this also lets us avoid having to "register" keys or bits of metadata with the kernel. We'd reserve some number of keys for hardware metadata. The remaining keys would be up to users. They'd have to allocate keys to services, and configure services to use those keys. This is similar to the way listening on a certain port works: only one service can use port 80 or 443, and that can typically beconfigured in a service's config file. This side-steps the whole question of how to change the registered metadata for in-flight packets, and how to deal with different NICs with different hardware metadata. I think I've figured out a suitable encoding format, hopefully we'll have an RFC soon! > -Toke >
Arthur Fabre <afabre@cloudflare.com> writes: > On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> FYI, we also had a discussion related to this at LPC on Friday, in this >> session: https://lpc.events/event/18/contributions/1935/ >> >> The context here was that Arthur and Jakub want to also support extended >> rich metadata all the way through the SKB path, and are looking at the >> same area used for XDP metadata to store it. So there's a need to manage >> both the kernel's own usage of that area, and userspace/BPF usage of it. >> >> I'll try to summarise some of the points of that discussion (all >> interpretations are my own, of course): >> >> - We want something that can be carried with a frame all the way from >> the XDP layer, through all SKB layers and to userspace (to replace the >> use of skb->mark for this purpose). >> >> - We want different applications running on the system (of which the >> kernel itself if one, cf this discussion) to be able to share this >> field, without having to have an out of band registry (like a Github >> repository where applications can agree on which bits to use). Which >> probably means that the kernel needs to be in the loop somehow to >> explicitly allocate space in the metadata area and track offsets. >> >> - Having an explicit API to access this from userspace, without having >> to go through BPF (i.e., a socket- or CMSG-based API) would be useful. >> > > Thanks for looping us in, and the great summary Toke! You're welcome :) >> The TLV format was one of the suggestions in Arthur and Jakub's talk, >> but AFAICT, there was not a lot of enthusiasm about this in the room >> (myself included), because of the parsing overhead and complexity. I >> believe the alternative that was seen as most favourable was a map >> lookup-style API, where applications can request a metadata area of >> arbitrary size and get an ID assigned that they can then use to set/get >> values in the data path. >> >> So, sketching this out, this could be realised by something like: >> >> /* could be called from BPF, or through netlink or sysfs; may fail, if >> * there is no more space >> */ >> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta)); >> >> The ID is just an opaque identifier that can then be passed to >> getter/setter functions (for both SKB and XDP), like: >> >> ret = bpf_set_packet_metadata_field(pkt, metadata_id, >> &my_meta_value, sizeof(my_meta_value)) >> >> ret = bpf_get_packet_metadata_field(pkt, metadata_id, >> &my_meta_value, sizeof(my_meta_value)) >> >> >> On the kernel side, the implementation would track registered fields in >> a global structure somewhere, say: >> >> struct pkt_metadata_entry { >> int id; >> u8 sz; >> u8 offset; >> u8 bit; >> }; >> >> struct pkt_metadata_registry { /* allocated as a system-wide global */ >> u8 num_entries; >> u8 total_size; >> struct pkt_metadata_entry entries[MAX_ENTRIES]; >> }; >> >> struct xdp_rx_meta { /* at then end of xdp_frame */ >> u8 sz; /* set to pkt_metadata_registry->total_size on alloc */ >> u8 fields_set; /* bitmap of fields that have been set, see below */ >> u8 data[]; >> }; >> >> int register_packet_metadata_field(u8 size) { >> struct pkt_metadata_registry *reg = get_global_registry(); >> struct pkt_metadata_entry *entry; >> >> if (size + reg->total_size > MAX_METADATA_SIZE) >> return -ENOSPC; >> >> entry = ®->entries[reg->num_entries++]; >> entry->id = assign_id(); >> entry->sz = size; >> entry->offset = reg->total_size; >> entry->bit = reg->num_entries - 1; >> reg->total_size += size; >> >> return entry->id; >> } >> >> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void >> *value, size_t sz) >> { >> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); >> >> if (!entry) >> return -ENOENT; >> >> if (entry->sz != sz) >> return -EINVAL; /* user error */ >> >> if (frm->rx_meta.sz < entry->offset + sz) >> return -EFAULT; /* entry allocated after xdp_frame was initialised */ >> >> memcpy(&frm->rx_meta.data + entry->offset, value, sz); >> frm->rx_meta.fields_set |= BIT(entry->bit); >> >> return 0; >> } >> >> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void >> *value, size_t sz) >> { >> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); >> >> if (!entry) >> return -ENOENT; >> >> if (entry->sz != sz) >> return -EINVAL; >> >> if (frm->rx_meta.sz < entry->offset + sz) >> return -EFAULT; /* entry allocated after xdp_frame was initialised */ >> >> if (!(frm->rx_meta.fields_set & BIT(entry->bit))) >> return -ENOENT; >> >> memcpy(value, &frm->rx_meta.data + entry->offset, sz); >> >> return 0; >> } >> >> I'm hinting at some complications here (with the EFAULT return) that >> needs to be resolved: there is no guarantee that a given packet will be >> in sync with the current status of the registered metadata, so we need >> explicit checks for this. If metadata entries are de-registered again >> this also means dealing with holes and/or reshuffling the metadata >> layout to reuse the released space (incidentally, this is the one place >> where a TLV format would have advantages). >> >> The nice thing about an API like this, though, is that it's extensible, >> and the kernel itself can be just another consumer of it for the >> metadata fields Lorenzo is adding in this series. I.e., we could just >> pre-define some IDs for metadata vlan, timestamp etc, and use the same >> functions as above from within the kernel to set and get those values; >> using the registry, there could even be an option to turn those off if >> an application wants more space for its own usage. Or, alternatively, we >> could keep the kernel-internal IDs hardcoded and always allocated, and >> just use the getter/setter functions as the BPF API for accessing them. > > That's exactly what I'm thinking of too, a simple API like: > > get(u8 key, u8 len, void *val); > set(u8 key, u8 len, void *val); > > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. > > If a NIC doesn't support a certain well-known metadata, the key > wouldn't be set, and get() would return ENOENT. > > I think this also lets us avoid having to "register" keys or bits of > metadata with the kernel. > We'd reserve some number of keys for hardware metadata. Right, but how do you allocate space/offset for each key without an explicit allocation step? You'd basically have to encode the list of IDs in the metadata area itself, which implies a TLV format that you have to walk on every access? The registry idea in my example above was basically to avoid that... > The remaining keys would be up to users. They'd have to allocate keys > to services, and configure services to use those keys. > This is similar to the way listening on a certain port works: only one > service can use port 80 or 443, and that can typically beconfigured in > a service's config file. Right, well, port numbers *do* actually have an out of band service registry (IANA), which I thought was what we wanted to avoid? ;) > This side-steps the whole question of how to change the registered > metadata for in-flight packets, and how to deal with different NICs > with different hardware metadata. > > I think I've figured out a suitable encoding format, hopefully we'll > have an RFC soon! Alright, cool! -Toke
> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes: > > >> I'm hinting at some complications here (with the EFAULT return) that > >> needs to be resolved: there is no guarantee that a given packet will be > >> in sync with the current status of the registered metadata, so we need > >> explicit checks for this. If metadata entries are de-registered again > >> this also means dealing with holes and/or reshuffling the metadata > >> layout to reuse the released space (incidentally, this is the one place > >> where a TLV format would have advantages). > > > > I like this approach but it seems to me more suitable for 'sw' metadata > > (this is main Arthur and Jakub use case iiuc) where the userspace would > > enable/disable these functionalities system-wide. > > Regarding device hw metadata (e.g. checksum offload) I can see some issues > > since on a system we can have multiple NICs with different capabilities. > > If we consider current codebase, stmmac driver supports only rx timestamp, > > while mlx5 supports all of them. In a theoretical system with these two > > NICs, since pkt_metadata_registry is global system-wide, we will end-up > > with quite a lot of holes for the stmmac, right? (I am not sure if this > > case is relevant or not). In other words, we will end-up with a fixed > > struct for device rx hw metadata (like xdp_rx_meta). So I am wondering > > if we really need all this complexity for xdp rx hw metadata? > > Well, the "holes" will be there anyway (in your static struct approach). > They would just correspond to parts of the "struct xdp_rx_meta" being > unset. yes, what I wanted to say is I have the feeling we will end up 90% of the times in the same fields architecture and the cases where we can save some space seem very limited. Anyway, I am fine to discuss about a common architecture. > > What the "userspace can turn off the fields system wide" would > accomplish is to *avoid* the holes if you know that you will never need > them. E.g., say a system administrator know that they have no networks > that use (offloaded) VLANs. They could then disable the vlan offload > field system-wide, and thus reclaim the four bytes taken up by the > "vlan" member of struct xdp_rx_meta, freeing that up for use by > application metadata. Even if I like the idea of having a common approach for this kernel feature, hw metadata seems to me quite a corner case with respect of 'user-defined metadata', since: - I do not think it is a common scenario to disable hw offload capabilities (e.g checksum offload in production) - I guess it is not just enough to disable them via bpf, but the user/sysadmin will need even to configure the NIC via ethtool (so a 2-steps process). I think we should pay attention to not overcomplicate something that is 99% enabled and just need to be fast. E.g I can see an issue of putting the hw rx metadata in metadata field since metadata grows backward and we will probably end up putting them in a different cacheline with respect to xdp_frame (xdp_headroom is usually 256B). > > However, it may well be that the complexity of allowing fields to be > turned off is not worth the gains. At least as long as there are only > the couple of HW metadata fields that we have currently. Having the > flexibility to change our minds later would be good, though, which is > mostly a matter of making the API exposed to BPF and/or userspace > flexible enough to allow us to move things around in memory in the > future. Which was basically my thought with the API I sketched out in > the previous email. I.e., you could go: > > ret = bpf_get_packet_metadata_field(pkt, METADATA_ID_HW_HASH, > &my_hash_vlaue, sizeof(u32)) yes, my plan is to add dedicated bpf kfuncs to store hw metadata in xdp_frame/xdp_buff > > > ...and the METADATA_ID_HW_HASH would be a constant defined by the > kernel, for which the bpf_get_packet_metadata_field() kfunc just has a > hardcoded lookup into struct xdp_rx_meta. And then, if we decide to move > the field in the future, we just change the kfunc implementation, with > no impact to the BPF programs calling it. > maybe we can use what we Stanislav have already added (maybe removing xdp prefix): enum xdp_rx_metadata { XDP_METADATA_KFUNC_RX_TIMESTAMP, XDP_METADATA_KFUNC_RX_HASH, XDP_METADATA_KFUNC_RX_VLAN_TAG }; > > Maybe we can start with a simple approach for xdp rx hw metadata > > putting the struct in xdp_frame as suggested by Jesper and covering > > the most common use-cases. We can then integrate this approach with > > Arthur/Jakub's solution without introducing any backward compatibility > > issue since these field are not visible to userspace. > > Yes, this is basically the gist of my suggestion (as I hopefully managed > to clarify above): Expose the fields via an API that is flexible enough > that we can move things around should the need arise, *and* which can > co-exist with the user-defined application metadata. ack Regards, Lorenzo > > -Toke >
On Thu Sep 26, 2024 at 2:41 PM CEST, Toke Høiland-Jørgensen wrote: > Arthur Fabre <afabre@cloudflare.com> writes: > > > On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> FYI, we also had a discussion related to this at LPC on Friday, in this > >> session: https://lpc.events/event/18/contributions/1935/ > >> > >> The context here was that Arthur and Jakub want to also support extended > >> rich metadata all the way through the SKB path, and are looking at the > >> same area used for XDP metadata to store it. So there's a need to manage > >> both the kernel's own usage of that area, and userspace/BPF usage of it. > >> > >> I'll try to summarise some of the points of that discussion (all > >> interpretations are my own, of course): > >> > >> - We want something that can be carried with a frame all the way from > >> the XDP layer, through all SKB layers and to userspace (to replace the > >> use of skb->mark for this purpose). > >> > >> - We want different applications running on the system (of which the > >> kernel itself if one, cf this discussion) to be able to share this > >> field, without having to have an out of band registry (like a Github > >> repository where applications can agree on which bits to use). Which > >> probably means that the kernel needs to be in the loop somehow to > >> explicitly allocate space in the metadata area and track offsets. > >> > >> - Having an explicit API to access this from userspace, without having > >> to go through BPF (i.e., a socket- or CMSG-based API) would be useful. > >> > > > > Thanks for looping us in, and the great summary Toke! > > You're welcome :) > > >> The TLV format was one of the suggestions in Arthur and Jakub's talk, > >> but AFAICT, there was not a lot of enthusiasm about this in the room > >> (myself included), because of the parsing overhead and complexity. I > >> believe the alternative that was seen as most favourable was a map > >> lookup-style API, where applications can request a metadata area of > >> arbitrary size and get an ID assigned that they can then use to set/get > >> values in the data path. > >> > >> So, sketching this out, this could be realised by something like: > >> > >> /* could be called from BPF, or through netlink or sysfs; may fail, if > >> * there is no more space > >> */ > >> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta)); > >> > >> The ID is just an opaque identifier that can then be passed to > >> getter/setter functions (for both SKB and XDP), like: > >> > >> ret = bpf_set_packet_metadata_field(pkt, metadata_id, > >> &my_meta_value, sizeof(my_meta_value)) > >> > >> ret = bpf_get_packet_metadata_field(pkt, metadata_id, > >> &my_meta_value, sizeof(my_meta_value)) > >> > >> > >> On the kernel side, the implementation would track registered fields in > >> a global structure somewhere, say: > >> > >> struct pkt_metadata_entry { > >> int id; > >> u8 sz; > >> u8 offset; > >> u8 bit; > >> }; > >> > >> struct pkt_metadata_registry { /* allocated as a system-wide global */ > >> u8 num_entries; > >> u8 total_size; > >> struct pkt_metadata_entry entries[MAX_ENTRIES]; > >> }; > >> > >> struct xdp_rx_meta { /* at then end of xdp_frame */ > >> u8 sz; /* set to pkt_metadata_registry->total_size on alloc */ > >> u8 fields_set; /* bitmap of fields that have been set, see below */ > >> u8 data[]; > >> }; > >> > >> int register_packet_metadata_field(u8 size) { > >> struct pkt_metadata_registry *reg = get_global_registry(); > >> struct pkt_metadata_entry *entry; > >> > >> if (size + reg->total_size > MAX_METADATA_SIZE) > >> return -ENOSPC; > >> > >> entry = ®->entries[reg->num_entries++]; > >> entry->id = assign_id(); > >> entry->sz = size; > >> entry->offset = reg->total_size; > >> entry->bit = reg->num_entries - 1; > >> reg->total_size += size; > >> > >> return entry->id; > >> } > >> > >> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void > >> *value, size_t sz) > >> { > >> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); > >> > >> if (!entry) > >> return -ENOENT; > >> > >> if (entry->sz != sz) > >> return -EINVAL; /* user error */ > >> > >> if (frm->rx_meta.sz < entry->offset + sz) > >> return -EFAULT; /* entry allocated after xdp_frame was initialised */ > >> > >> memcpy(&frm->rx_meta.data + entry->offset, value, sz); > >> frm->rx_meta.fields_set |= BIT(entry->bit); > >> > >> return 0; > >> } > >> > >> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void > >> *value, size_t sz) > >> { > >> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id); > >> > >> if (!entry) > >> return -ENOENT; > >> > >> if (entry->sz != sz) > >> return -EINVAL; > >> > >> if (frm->rx_meta.sz < entry->offset + sz) > >> return -EFAULT; /* entry allocated after xdp_frame was initialised */ > >> > >> if (!(frm->rx_meta.fields_set & BIT(entry->bit))) > >> return -ENOENT; > >> > >> memcpy(value, &frm->rx_meta.data + entry->offset, sz); > >> > >> return 0; > >> } > >> > >> I'm hinting at some complications here (with the EFAULT return) that > >> needs to be resolved: there is no guarantee that a given packet will be > >> in sync with the current status of the registered metadata, so we need > >> explicit checks for this. If metadata entries are de-registered again > >> this also means dealing with holes and/or reshuffling the metadata > >> layout to reuse the released space (incidentally, this is the one place > >> where a TLV format would have advantages). > >> > >> The nice thing about an API like this, though, is that it's extensible, > >> and the kernel itself can be just another consumer of it for the > >> metadata fields Lorenzo is adding in this series. I.e., we could just > >> pre-define some IDs for metadata vlan, timestamp etc, and use the same > >> functions as above from within the kernel to set and get those values; > >> using the registry, there could even be an option to turn those off if > >> an application wants more space for its own usage. Or, alternatively, we > >> could keep the kernel-internal IDs hardcoded and always allocated, and > >> just use the getter/setter functions as the BPF API for accessing them. > > > > That's exactly what I'm thinking of too, a simple API like: > > > > get(u8 key, u8 len, void *val); > > set(u8 key, u8 len, void *val); > > > > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. > > > > If a NIC doesn't support a certain well-known metadata, the key > > wouldn't be set, and get() would return ENOENT. > > > > I think this also lets us avoid having to "register" keys or bits of > > metadata with the kernel. > > We'd reserve some number of keys for hardware metadata. > > Right, but how do you allocate space/offset for each key without an > explicit allocation step? You'd basically have to encode the list of IDs > in the metadata area itself, which implies a TLV format that you have to > walk on every access? The registry idea in my example above was > basically to avoid that... I've been playing around with having a small fixed header at the front of the metadata itself, that lets you access values without walking them all. Still WIP, and maybe this is too restrictive, but it lets you encode 64 2, 4, or 8 byte KVs with a single 16 byte header: #include <stdio.h> #include <stdint.h> #include <stdbool.h> #include <errno.h> #include <string.h> /** * Very limited KV support: * * - Keys: 0-63. (TBD which are reserved for the kernel) * - Value size: 2, 4, or 8. * * But compact, uses a fixed 16 byte header only. * Reasonably fast access. * Insertion and deletion need a memmove (entries have to be sorted), but it's ~100s of bytes of max. * * TODO - should we support more sizes? Can switch to three bits per entry. * TODO - should we support a 0 size to just the presence / absence of key without a value? */ bool valid_len(uint8_t len) { return len == 2 || len == 4 || len == 8; } bool valid_key(uint8_t key) { return key < 64; } // Fixed header at the start of the meta area. struct hdr { // 2 bit length is stored for each key: // a high bit in the high word. // a low bit in the low word. // Key itself is the bit position in each word, LSb 0. // This lets us count the bits in high and low to easily // calculate the sum of the preceeding KVs lengths. uint64_t high; uint64_t low; }; int total_length(struct hdr h) { // TODO - is this builtin allowed in kernel code? return (__builtin_popcountll(h.high) << 2) + (__builtin_popcountll(h.low) << 1); } struct hdr and(struct hdr h, uint64_t mask) { return (struct hdr){ h.high & mask, h.low & mask, }; } int offset(struct hdr h, uint8_t key) { // Calculate total length of previous keys by masking out keys after. return sizeof(struct hdr) + total_length(and(h, ~(~0llu << key))); } // TODO - is headroom zero initialized? #define META_LEN (sizeof(struct hdr) + 128) uint8_t meta[META_LEN]; int set(uint8_t key, uint8_t len, void *val) { if (!valid_key(key)) { return -EINVAL; } if (!valid_len(len)) { return -EINVAL; } struct hdr *h = (struct hdr *)meta; // Figure out if we have enough room left: total length of everything now. if (sizeof(struct hdr) + total_length(*h) + len > sizeof(meta)) { return -ENOMEM; } // Offset of value of this key. int off = offset(*h, key); // Memmove all the kvs after us over. memmove(meta+off+len, meta+off, sizeof(meta)-off); // Set our value. memcpy(meta+off, val, len); // Store our length in header. uint64_t encode_len = 0; switch (len) { case 2: encode_len = 1; break; case 4: encode_len = 2; break; case 8: encode_len = 3; break; } h->high |= (encode_len >> 1) << key; h->low |= (encode_len & 1) << key; return 0; } // Callers need to know the format ahead of time, // so they'll know the length too. // We just check the buffer is big enough for the value. int get(uint8_t key, uint8_t len, void *val) { if (!valid_key(key)) { return -EINVAL; } if (!valid_len(len)) { return -EINVAL; } struct hdr h = *(struct hdr *)meta; // Check key is set. if (!((h.high & (1ull << key)) || (h.low & (1ull << key)))) { return -ENOENT; } // Offset of value of this key. int off = offset(h, key); // Figure out our length. int real_len = total_length(and(h, (1ull << key))); if (real_len > len) { return -EFBIG; } memcpy(val, meta+off, real_len); return 0; } int del(uint8_t key) { if (!valid_key(key)) { return -EINVAL; } struct hdr *h = (struct hdr *)meta; // Check key is set. if (!((h->high & (1ull << key)) || (h->low & (1ull << key)))) { return -ENOENT; } // Offset and length of value of this key. int off = offset(*h, key); int len = total_length(and(*h, (1ull << key))); // Memmove all the kvs after us over. memmove(meta+off, meta+off+len, sizeof(meta)-off-len); // Clear our length in header. h->high &= ~(1ull << key); h->low &= ~(1ull << key); return 0; } > > > The remaining keys would be up to users. They'd have to allocate keys > > to services, and configure services to use those keys. > > This is similar to the way listening on a certain port works: only one > > service can use port 80 or 443, and that can typically beconfigured in > > a service's config file. > > Right, well, port numbers *do* actually have an out of band service > registry (IANA), which I thought was what we wanted to avoid? ;) Depends how you think about it ;) I think we should avoid a global registry. But having a registry per deployment / server doesn't seem awful. Services that want to use a field would have a config file setting to set which index it corresponds to. Admins would just have to pick a free one on their system, and set it in the config file of the service. This is similar to what we do for non-IANA registered ports internally. For example each service needs a port on an internal interface to expose metrics, and we just track which ports are taken in our config management. Dynamically registering fields means you have to share the returned ID with whoever is interested, which sounds tricky. If an XDP program sets a field like packet_id, every tracing program that looks at it, and userspace service, would need to know what the ID of that field is. Is there a way to easily share that ID with all of them? > > > This side-steps the whole question of how to change the registered > > metadata for in-flight packets, and how to deal with different NICs > > with different hardware metadata. > > > > I think I've figured out a suitable encoding format, hopefully we'll > > have an RFC soon! > > Alright, cool! > > -Toke
On 09/26, Lorenzo Bianconi wrote: > > Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes: > > > > >> I'm hinting at some complications here (with the EFAULT return) that > > >> needs to be resolved: there is no guarantee that a given packet will be > > >> in sync with the current status of the registered metadata, so we need > > >> explicit checks for this. If metadata entries are de-registered again > > >> this also means dealing with holes and/or reshuffling the metadata > > >> layout to reuse the released space (incidentally, this is the one place > > >> where a TLV format would have advantages). > > > > > > I like this approach but it seems to me more suitable for 'sw' metadata > > > (this is main Arthur and Jakub use case iiuc) where the userspace would > > > enable/disable these functionalities system-wide. > > > Regarding device hw metadata (e.g. checksum offload) I can see some issues > > > since on a system we can have multiple NICs with different capabilities. > > > If we consider current codebase, stmmac driver supports only rx timestamp, > > > while mlx5 supports all of them. In a theoretical system with these two > > > NICs, since pkt_metadata_registry is global system-wide, we will end-up > > > with quite a lot of holes for the stmmac, right? (I am not sure if this > > > case is relevant or not). In other words, we will end-up with a fixed > > > struct for device rx hw metadata (like xdp_rx_meta). So I am wondering > > > if we really need all this complexity for xdp rx hw metadata? > > > > Well, the "holes" will be there anyway (in your static struct approach). > > They would just correspond to parts of the "struct xdp_rx_meta" being > > unset. > > yes, what I wanted to say is I have the feeling we will end up 90% of the > times in the same fields architecture and the cases where we can save some > space seem very limited. Anyway, I am fine to discuss about a common > architecture. > > > > > What the "userspace can turn off the fields system wide" would > > accomplish is to *avoid* the holes if you know that you will never need > > them. E.g., say a system administrator know that they have no networks > > that use (offloaded) VLANs. They could then disable the vlan offload > > field system-wide, and thus reclaim the four bytes taken up by the > > "vlan" member of struct xdp_rx_meta, freeing that up for use by > > application metadata. > > Even if I like the idea of having a common approach for this kernel feature, > hw metadata seems to me quite a corner case with respect of 'user-defined > metadata', since: > - I do not think it is a common scenario to disable hw offload capabilities > (e.g checksum offload in production) > - I guess it is not just enough to disable them via bpf, but the user/sysadmin > will need even to configure the NIC via ethtool (so a 2-steps process). > > I think we should pay attention to not overcomplicate something that is 99% > enabled and just need to be fast. E.g I can see an issue of putting the hw rx > metadata in metadata field since metadata grows backward and we will probably > end up putting them in a different cacheline with respect to xdp_frame > (xdp_headroom is usually 256B). > > > > > However, it may well be that the complexity of allowing fields to be > > turned off is not worth the gains. At least as long as there are only > > the couple of HW metadata fields that we have currently. Having the > > flexibility to change our minds later would be good, though, which is > > mostly a matter of making the API exposed to BPF and/or userspace > > flexible enough to allow us to move things around in memory in the > > future. Which was basically my thought with the API I sketched out in > > the previous email. I.e., you could go: > > > > ret = bpf_get_packet_metadata_field(pkt, METADATA_ID_HW_HASH, > > &my_hash_vlaue, sizeof(u32)) > > yes, my plan is to add dedicated bpf kfuncs to store hw metadata in > xdp_frame/xdp_buff > > > > > > > ...and the METADATA_ID_HW_HASH would be a constant defined by the > > kernel, for which the bpf_get_packet_metadata_field() kfunc just has a > > hardcoded lookup into struct xdp_rx_meta. And then, if we decide to move > > the field in the future, we just change the kfunc implementation, with > > no impact to the BPF programs calling it. > > > > maybe we can use what we Stanislav have already added (maybe removing xdp > prefix): > > enum xdp_rx_metadata { > XDP_METADATA_KFUNC_RX_TIMESTAMP, > XDP_METADATA_KFUNC_RX_HASH, > XDP_METADATA_KFUNC_RX_VLAN_TAG > }; I think it was one of the ideas floating around back then for the xdp->skb case (including redirection): have an extra kfunc that the bpf program can call and make this kfunc store the metadata (in the metadata area) in some fixed format. Then the kernel can consume it.
"Arthur Fabre" <afabre@cloudflare.com> writes: >> >> The nice thing about an API like this, though, is that it's extensible, >> >> and the kernel itself can be just another consumer of it for the >> >> metadata fields Lorenzo is adding in this series. I.e., we could just >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same >> >> functions as above from within the kernel to set and get those values; >> >> using the registry, there could even be an option to turn those off if >> >> an application wants more space for its own usage. Or, alternatively, we >> >> could keep the kernel-internal IDs hardcoded and always allocated, and >> >> just use the getter/setter functions as the BPF API for accessing them. >> > >> > That's exactly what I'm thinking of too, a simple API like: >> > >> > get(u8 key, u8 len, void *val); >> > set(u8 key, u8 len, void *val); >> > >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. >> > >> > If a NIC doesn't support a certain well-known metadata, the key >> > wouldn't be set, and get() would return ENOENT. >> > >> > I think this also lets us avoid having to "register" keys or bits of >> > metadata with the kernel. >> > We'd reserve some number of keys for hardware metadata. >> >> Right, but how do you allocate space/offset for each key without an >> explicit allocation step? You'd basically have to encode the list of IDs >> in the metadata area itself, which implies a TLV format that you have to >> walk on every access? The registry idea in my example above was >> basically to avoid that... > > I've been playing around with having a small fixed header at the front > of the metadata itself, that lets you access values without walking them > all. > > Still WIP, and maybe this is too restrictive, but it lets you encode 64 > 2, 4, or 8 byte KVs with a single 16 byte header: Ohh, that's clever, I like it! :) It's also extensible in the sense that the internal representation can change without impacting the API, so if we end up needing more bits we can always add those. Maybe it would be a good idea to make the 'key' parameter a larger integer type (function parameters are always 64-bit anyway, so might as well go all the way up to u64)? That way we can use higher values for the kernel-reserved types instead of reserving part of the already-small key space for applications (assuming the kernel-internal data is stored somewhere else, like in a static struct as in Lorenzo's patch)? [...] >> > The remaining keys would be up to users. They'd have to allocate keys >> > to services, and configure services to use those keys. >> > This is similar to the way listening on a certain port works: only one >> > service can use port 80 or 443, and that can typically beconfigured in >> > a service's config file. >> >> Right, well, port numbers *do* actually have an out of band service >> registry (IANA), which I thought was what we wanted to avoid? ;) > > Depends how you think about it ;) > > I think we should avoid a global registry. But having a registry per > deployment / server doesn't seem awful. Services that want to use a > field would have a config file setting to set which index it corresponds > to. > Admins would just have to pick a free one on their system, and set it in > the config file of the service. > > This is similar to what we do for non-IANA registered ports internally. > For example each service needs a port on an internal interface to expose > metrics, and we just track which ports are taken in our config > management. Right, this would work, but it assumes that applications are well-behaved and do this correctly. Which they probably do in a centrally-managed system like yours, but for random applications shipped by distros, I'm not sure if it's going to work. In fact, it's more or less the situation we have with skb->mark today, isn't it? I.e., applications can co-exist as long as they don't use the same bits, so they have to coordinate on which bits to use. Sure, with this scheme there will be more total bits to use, which can lessen the pressure somewhat, but the basic problem remains. In other words, I worry that in practice we will end up with another github repository serving as a registry for metadata keys... > Dynamically registering fields means you have to share the returned ID > with whoever is interested, which sounds tricky. > If an XDP program sets a field like packet_id, every tracing > program that looks at it, and userspace service, would need to know what > the ID of that field is. > Is there a way to easily share that ID with all of them? Right, so I'll admit this was one of the handwavy bits of my original proposal, but I don't think it's unsolvable. You could do something like (once, on application initialisation): __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ bpf_map_update(&shared_application_config, &my_key_index, &my_key); and then just get the key out of that map from all programs that want to use it? We could combine such a registration API with your header format, so that the registration just becomes a way of allocating one of the keys from 0-63 (and the registry just becomes a global copy of the header). This would basically amount to moving the "service config file" into the kernel, since that seems to be the only common denominator we can rely on between BPF applications (as all attempts to write a common daemon for BPF management have shown). -Toke
On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote: > "Arthur Fabre" <afabre@cloudflare.com> writes: > > >> >> The nice thing about an API like this, though, is that it's extensible, > >> >> and the kernel itself can be just another consumer of it for the > >> >> metadata fields Lorenzo is adding in this series. I.e., we could just > >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same > >> >> functions as above from within the kernel to set and get those values; > >> >> using the registry, there could even be an option to turn those off if > >> >> an application wants more space for its own usage. Or, alternatively, we > >> >> could keep the kernel-internal IDs hardcoded and always allocated, and > >> >> just use the getter/setter functions as the BPF API for accessing them. > >> > > >> > That's exactly what I'm thinking of too, a simple API like: > >> > > >> > get(u8 key, u8 len, void *val); > >> > set(u8 key, u8 len, void *val); > >> > > >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. > >> > > >> > If a NIC doesn't support a certain well-known metadata, the key > >> > wouldn't be set, and get() would return ENOENT. > >> > > >> > I think this also lets us avoid having to "register" keys or bits of > >> > metadata with the kernel. > >> > We'd reserve some number of keys for hardware metadata. > >> > >> Right, but how do you allocate space/offset for each key without an > >> explicit allocation step? You'd basically have to encode the list of IDs > >> in the metadata area itself, which implies a TLV format that you have to > >> walk on every access? The registry idea in my example above was > >> basically to avoid that... > > > > I've been playing around with having a small fixed header at the front > > of the metadata itself, that lets you access values without walking them > > all. > > > > Still WIP, and maybe this is too restrictive, but it lets you encode 64 > > 2, 4, or 8 byte KVs with a single 16 byte header: > > Ohh, that's clever, I like it! :) > > It's also extensible in the sense that the internal representation can > change without impacting the API, so if we end up needing more bits we > can always add those. > > Maybe it would be a good idea to make the 'key' parameter a larger > integer type (function parameters are always 64-bit anyway, so might as > well go all the way up to u64)? That way we can use higher values for > the kernel-reserved types instead of reserving part of the already-small > key space for applications (assuming the kernel-internal data is stored > somewhere else, like in a static struct as in Lorenzo's patch)? Good idea! That makes it more extensible too if we ever support more keys or bigger lengths. I'm not sure where the kernel-reserved types should live. Putting them in here uses up some the of KV IDs, but it uses less head room space than always reserving a static struct for them. Maybe it doesn't matter much, as long as we use the same API to access them, we could internally switch between one and the other. > > [...] > > >> > The remaining keys would be up to users. They'd have to allocate keys > >> > to services, and configure services to use those keys. > >> > This is similar to the way listening on a certain port works: only one > >> > service can use port 80 or 443, and that can typically beconfigured in > >> > a service's config file. > >> > >> Right, well, port numbers *do* actually have an out of band service > >> registry (IANA), which I thought was what we wanted to avoid? ;) > > > > Depends how you think about it ;) > > > > I think we should avoid a global registry. But having a registry per > > deployment / server doesn't seem awful. Services that want to use a > > field would have a config file setting to set which index it corresponds > > to. > > Admins would just have to pick a free one on their system, and set it in > > the config file of the service. > > > > This is similar to what we do for non-IANA registered ports internally. > > For example each service needs a port on an internal interface to expose > > metrics, and we just track which ports are taken in our config > > management. > > Right, this would work, but it assumes that applications are > well-behaved and do this correctly. Which they probably do in a > centrally-managed system like yours, but for random applications shipped > by distros, I'm not sure if it's going to work. > > In fact, it's more or less the situation we have with skb->mark today, > isn't it? I.e., applications can co-exist as long as they don't use the > same bits, so they have to coordinate on which bits to use. Sure, with > this scheme there will be more total bits to use, which can lessen the > pressure somewhat, but the basic problem remains. In other words, I > worry that in practice we will end up with another github repository > serving as a registry for metadata keys... That's true. If applications hardcode keys, we'll be back to having conflicts. If someone creates a registry on github I'll be very sad. (Maybe we can make the verifier enforce that the key passed to get() and set() isn't a constant? - only joking) Applications don't tend to do this for ports though, I think most can be configured to listen on any port. Is that just because it's been instilled as "good practice" over time? Could we try to do the same with some very stern documentation and examples? Thinking about it more, my only relectance for a registration API is how to communicate the ID back to other consumers (our discussion below). > > > Dynamically registering fields means you have to share the returned ID > > with whoever is interested, which sounds tricky. > > If an XDP program sets a field like packet_id, every tracing > > program that looks at it, and userspace service, would need to know what > > the ID of that field is. > > Is there a way to easily share that ID with all of them? > > Right, so I'll admit this was one of the handwavy bits of my original > proposal, but I don't think it's unsolvable. You could do something like > (once, on application initialisation): > > __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ > bpf_map_update(&shared_application_config, &my_key_index, &my_key); > > and then just get the key out of that map from all programs that want to > use it? Passing it out of band works (whether it's via a pinned map like you described, or through other means like a unix socket or some other API), it's just more complicated. Every consumer also needs to know about that API. That won't work with standard tools. For example if we set a PACKET_ID KV, maybe we could give it to pwru so it could track packets using it? Without registering keys, we could pass it as a cli flag. With registration, we'd have to have some helper to get the KV ID. It also introduces ordering dependencies between the services on startup, eg packet tracing hooks could only be attached once our XDP service has registered a PACKET_ID KV, and they could query it's API. > > We could combine such a registration API with your header format, so > that the registration just becomes a way of allocating one of the keys > from 0-63 (and the registry just becomes a global copy of the header). > This would basically amount to moving the "service config file" into the > kernel, since that seems to be the only common denominator we can rely > on between BPF applications (as all attempts to write a common daemon > for BPF management have shown). That sounds reasonable. And I guess we'd have set() check the global registry to enforce that the key has been registered beforehand? > > -Toke Thanks for all the feedback!
On Sep 27, Arthur Fabre wrote: > On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote: > > "Arthur Fabre" <afabre@cloudflare.com> writes: > > > > >> >> The nice thing about an API like this, though, is that it's extensible, > > >> >> and the kernel itself can be just another consumer of it for the > > >> >> metadata fields Lorenzo is adding in this series. I.e., we could just > > >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same > > >> >> functions as above from within the kernel to set and get those values; > > >> >> using the registry, there could even be an option to turn those off if > > >> >> an application wants more space for its own usage. Or, alternatively, we > > >> >> could keep the kernel-internal IDs hardcoded and always allocated, and > > >> >> just use the getter/setter functions as the BPF API for accessing them. > > >> > > > >> > That's exactly what I'm thinking of too, a simple API like: > > >> > > > >> > get(u8 key, u8 len, void *val); > > >> > set(u8 key, u8 len, void *val); > > >> > > > >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. > > >> > > > >> > If a NIC doesn't support a certain well-known metadata, the key > > >> > wouldn't be set, and get() would return ENOENT. > > >> > > > >> > I think this also lets us avoid having to "register" keys or bits of > > >> > metadata with the kernel. > > >> > We'd reserve some number of keys for hardware metadata. > > >> > > >> Right, but how do you allocate space/offset for each key without an > > >> explicit allocation step? You'd basically have to encode the list of IDs > > >> in the metadata area itself, which implies a TLV format that you have to > > >> walk on every access? The registry idea in my example above was > > >> basically to avoid that... > > > > > > I've been playing around with having a small fixed header at the front > > > of the metadata itself, that lets you access values without walking them > > > all. > > > > > > Still WIP, and maybe this is too restrictive, but it lets you encode 64 > > > 2, 4, or 8 byte KVs with a single 16 byte header: > > > > Ohh, that's clever, I like it! :) > > > > It's also extensible in the sense that the internal representation can > > change without impacting the API, so if we end up needing more bits we > > can always add those. > > > > Maybe it would be a good idea to make the 'key' parameter a larger > > integer type (function parameters are always 64-bit anyway, so might as > > well go all the way up to u64)? That way we can use higher values for > > the kernel-reserved types instead of reserving part of the already-small > > key space for applications (assuming the kernel-internal data is stored > > somewhere else, like in a static struct as in Lorenzo's patch)? > > Good idea! That makes it more extensible too if we ever support more > keys or bigger lengths. > > I'm not sure where the kernel-reserved types should live. Putting them > in here uses up some the of KV IDs, but it uses less head room space than > always reserving a static struct for them. > Maybe it doesn't matter much, as long as we use the same API to access > them, we could internally switch between one and the other. > > > > > [...] > > > > >> > The remaining keys would be up to users. They'd have to allocate keys > > >> > to services, and configure services to use those keys. > > >> > This is similar to the way listening on a certain port works: only one > > >> > service can use port 80 or 443, and that can typically beconfigured in > > >> > a service's config file. > > >> > > >> Right, well, port numbers *do* actually have an out of band service > > >> registry (IANA), which I thought was what we wanted to avoid? ;) > > > > > > Depends how you think about it ;) > > > > > > I think we should avoid a global registry. But having a registry per > > > deployment / server doesn't seem awful. Services that want to use a > > > field would have a config file setting to set which index it corresponds > > > to. > > > Admins would just have to pick a free one on their system, and set it in > > > the config file of the service. > > > > > > This is similar to what we do for non-IANA registered ports internally. > > > For example each service needs a port on an internal interface to expose > > > metrics, and we just track which ports are taken in our config > > > management. > > > > Right, this would work, but it assumes that applications are > > well-behaved and do this correctly. Which they probably do in a > > centrally-managed system like yours, but for random applications shipped > > by distros, I'm not sure if it's going to work. > > > > In fact, it's more or less the situation we have with skb->mark today, > > isn't it? I.e., applications can co-exist as long as they don't use the > > same bits, so they have to coordinate on which bits to use. Sure, with > > this scheme there will be more total bits to use, which can lessen the > > pressure somewhat, but the basic problem remains. In other words, I > > worry that in practice we will end up with another github repository > > serving as a registry for metadata keys... > > That's true. If applications hardcode keys, we'll be back to having > conflicts. If someone creates a registry on github I'll be very sad. > > (Maybe we can make the verifier enforce that the key passed to get() and > set() isn't a constant? - only joking) > > Applications don't tend to do this for ports though, I think most can be > configured to listen on any port. Is that just because it's been > instilled as "good practice" over time? Could we try to do the same with > some very stern documentation and examples? > > Thinking about it more, my only relectance for a registration API is how > to communicate the ID back to other consumers (our discussion below). > > > > > > Dynamically registering fields means you have to share the returned ID > > > with whoever is interested, which sounds tricky. > > > If an XDP program sets a field like packet_id, every tracing > > > program that looks at it, and userspace service, would need to know what > > > the ID of that field is. > > > Is there a way to easily share that ID with all of them? > > > > Right, so I'll admit this was one of the handwavy bits of my original > > proposal, but I don't think it's unsolvable. You could do something like > > (once, on application initialisation): > > > > __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ > > bpf_map_update(&shared_application_config, &my_key_index, &my_key); > > > > and then just get the key out of that map from all programs that want to > > use it? > > Passing it out of band works (whether it's via a pinned map like you > described, or through other means like a unix socket or some other > API), it's just more complicated. > > Every consumer also needs to know about that API. That won't work with > standard tools. For example if we set a PACKET_ID KV, maybe we could > give it to pwru so it could track packets using it? > Without registering keys, we could pass it as a cli flag. With > registration, we'd have to have some helper to get the KV ID. > > It also introduces ordering dependencies between the services on > startup, eg packet tracing hooks could only be attached once our XDP > service has registered a PACKET_ID KV, and they could query it's API. > > > > > We could combine such a registration API with your header format, so > > that the registration just becomes a way of allocating one of the keys > > from 0-63 (and the registry just becomes a global copy of the header). > > This would basically amount to moving the "service config file" into the > > kernel, since that seems to be the only common denominator we can rely > > on between BPF applications (as all attempts to write a common daemon > > for BPF management have shown). > > That sounds reasonable. And I guess we'd have set() check the global > registry to enforce that the key has been registered beforehand? > > > > > -Toke > > Thanks for all the feedback! I like this 'fast' KV approach but I guess we should really evaluate its impact on performances (especially for xdp) since, based on the kfunc calls order in the ebpf program, we can have one or multiple memmove/memcpy for each packet, right? Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not so suitable for nic hw metadata since: - it grows backward - it is probably in a different cacheline with respect to xdp_frame - nic hw metadata will not start at fixed and immutable address, but it depends on the running ebpf program What about having something like: - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end of the metadata area :)). Here he can reuse the same KV approach if it is fast - user defined metadata: in the metadata area of the xdp_frame/xdp_buff Regards, Lorenzo
"Arthur Fabre" <afabre@cloudflare.com> writes: > On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote: >> "Arthur Fabre" <afabre@cloudflare.com> writes: >> >> >> >> The nice thing about an API like this, though, is that it's extensible, >> >> >> and the kernel itself can be just another consumer of it for the >> >> >> metadata fields Lorenzo is adding in this series. I.e., we could just >> >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same >> >> >> functions as above from within the kernel to set and get those values; >> >> >> using the registry, there could even be an option to turn those off if >> >> >> an application wants more space for its own usage. Or, alternatively, we >> >> >> could keep the kernel-internal IDs hardcoded and always allocated, and >> >> >> just use the getter/setter functions as the BPF API for accessing them. >> >> > >> >> > That's exactly what I'm thinking of too, a simple API like: >> >> > >> >> > get(u8 key, u8 len, void *val); >> >> > set(u8 key, u8 len, void *val); >> >> > >> >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata. >> >> > >> >> > If a NIC doesn't support a certain well-known metadata, the key >> >> > wouldn't be set, and get() would return ENOENT. >> >> > >> >> > I think this also lets us avoid having to "register" keys or bits of >> >> > metadata with the kernel. >> >> > We'd reserve some number of keys for hardware metadata. >> >> >> >> Right, but how do you allocate space/offset for each key without an >> >> explicit allocation step? You'd basically have to encode the list of IDs >> >> in the metadata area itself, which implies a TLV format that you have to >> >> walk on every access? The registry idea in my example above was >> >> basically to avoid that... >> > >> > I've been playing around with having a small fixed header at the front >> > of the metadata itself, that lets you access values without walking them >> > all. >> > >> > Still WIP, and maybe this is too restrictive, but it lets you encode 64 >> > 2, 4, or 8 byte KVs with a single 16 byte header: >> >> Ohh, that's clever, I like it! :) >> >> It's also extensible in the sense that the internal representation can >> change without impacting the API, so if we end up needing more bits we >> can always add those. >> >> Maybe it would be a good idea to make the 'key' parameter a larger >> integer type (function parameters are always 64-bit anyway, so might as >> well go all the way up to u64)? That way we can use higher values for >> the kernel-reserved types instead of reserving part of the already-small >> key space for applications (assuming the kernel-internal data is stored >> somewhere else, like in a static struct as in Lorenzo's patch)? > > Good idea! That makes it more extensible too if we ever support more > keys or bigger lengths. > > I'm not sure where the kernel-reserved types should live. Putting them > in here uses up some the of KV IDs, but it uses less head room space than > always reserving a static struct for them. > Maybe it doesn't matter much, as long as we use the same API to access > them, we could internally switch between one and the other. Yeah, as long as we can move them around, we can do the thing that makes most sense from a performance PoV, and we can change it later if that changes. >> >> [...] >> >> >> > The remaining keys would be up to users. They'd have to allocate keys >> >> > to services, and configure services to use those keys. >> >> > This is similar to the way listening on a certain port works: only one >> >> > service can use port 80 or 443, and that can typically beconfigured in >> >> > a service's config file. >> >> >> >> Right, well, port numbers *do* actually have an out of band service >> >> registry (IANA), which I thought was what we wanted to avoid? ;) >> > >> > Depends how you think about it ;) >> > >> > I think we should avoid a global registry. But having a registry per >> > deployment / server doesn't seem awful. Services that want to use a >> > field would have a config file setting to set which index it corresponds >> > to. >> > Admins would just have to pick a free one on their system, and set it in >> > the config file of the service. >> > >> > This is similar to what we do for non-IANA registered ports internally. >> > For example each service needs a port on an internal interface to expose >> > metrics, and we just track which ports are taken in our config >> > management. >> >> Right, this would work, but it assumes that applications are >> well-behaved and do this correctly. Which they probably do in a >> centrally-managed system like yours, but for random applications shipped >> by distros, I'm not sure if it's going to work. >> >> In fact, it's more or less the situation we have with skb->mark today, >> isn't it? I.e., applications can co-exist as long as they don't use the >> same bits, so they have to coordinate on which bits to use. Sure, with >> this scheme there will be more total bits to use, which can lessen the >> pressure somewhat, but the basic problem remains. In other words, I >> worry that in practice we will end up with another github repository >> serving as a registry for metadata keys... > > That's true. If applications hardcode keys, we'll be back to having > conflicts. If someone creates a registry on github I'll be very sad. > > (Maybe we can make the verifier enforce that the key passed to get() and > set() isn't a constant? - only joking) > > Applications don't tend to do this for ports though, I think most can be > configured to listen on any port. Is that just because it's been > instilled as "good practice" over time? Could we try to do the same with > some very stern documentation and examples? Well, telling users "you're doing it wrong" hasn't been that successful in the past, I'm afraid... > Thinking about it more, my only relectance for a registration API is how > to communicate the ID back to other consumers (our discussion below). > >> >> > Dynamically registering fields means you have to share the returned ID >> > with whoever is interested, which sounds tricky. >> > If an XDP program sets a field like packet_id, every tracing >> > program that looks at it, and userspace service, would need to know what >> > the ID of that field is. >> > Is there a way to easily share that ID with all of them? >> >> Right, so I'll admit this was one of the handwavy bits of my original >> proposal, but I don't think it's unsolvable. You could do something like >> (once, on application initialisation): >> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ >> bpf_map_update(&shared_application_config, &my_key_index, &my_key); >> >> and then just get the key out of that map from all programs that want to >> use it? > > Passing it out of band works (whether it's via a pinned map like you > described, or through other means like a unix socket or some other > API), it's just more complicated. > > Every consumer also needs to know about that API. That won't work with > standard tools. For example if we set a PACKET_ID KV, maybe we could > give it to pwru so it could track packets using it? > Without registering keys, we could pass it as a cli flag. With > registration, we'd have to have some helper to get the KV ID. > > It also introduces ordering dependencies between the services on > startup, eg packet tracing hooks could only be attached once our XDP > service has registered a PACKET_ID KV, and they could query it's API. Yeah, we definitely need a way to make that accessible and not too cumbersome. I suppose what we really need is a way to map an application-specific identifier to an ID. And, well, those identifiers could just be (string) names? That's what we do for CO-RE, after all. So you'd get something like: id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */ id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */ and we make that idempotent, so that two callers using the same name and size will just get the same id back; and if called with BPF_NO_CREATE, you'll get -ENOENT if it hasn't already been registered by someone else? We could even make this a sub-command of the bpf() syscall if we want it to be UAPI, but that's not strictly necessary, applications can also just call the registration from a syscall program at startup... >> We could combine such a registration API with your header format, so >> that the registration just becomes a way of allocating one of the keys >> from 0-63 (and the registry just becomes a global copy of the header). >> This would basically amount to moving the "service config file" into the >> kernel, since that seems to be the only common denominator we can rely >> on between BPF applications (as all attempts to write a common daemon >> for BPF management have shown). > > That sounds reasonable. And I guess we'd have set() check the global > registry to enforce that the key has been registered beforehand? Yes, exactly. And maybe check that the size matches as well just to remove the obvious footgun of accidentally stepping on each other's toes? > Thanks for all the feedback! You're welcome! Thanks for working on this :) -Toke
Lorenzo Bianconi <lorenzo@kernel.org> writes: >> > We could combine such a registration API with your header format, so >> > that the registration just becomes a way of allocating one of the keys >> > from 0-63 (and the registry just becomes a global copy of the header). >> > This would basically amount to moving the "service config file" into the >> > kernel, since that seems to be the only common denominator we can rely >> > on between BPF applications (as all attempts to write a common daemon >> > for BPF management have shown). >> >> That sounds reasonable. And I guess we'd have set() check the global >> registry to enforce that the key has been registered beforehand? >> >> > >> > -Toke >> >> Thanks for all the feedback! > > I like this 'fast' KV approach but I guess we should really evaluate its > impact on performances (especially for xdp) since, based on the kfunc calls > order in the ebpf program, we can have one or multiple memmove/memcpy for > each packet, right? Yes, with Arthur's scheme, performance will be ordering dependent. Using a global registry for offsets would sidestep this, but have the synchronisation issues we discussed up-thread. So on balance, I think the memmove() suggestion will probably lead to the least pain. For the HW metadata we could sidestep this by always having a fixed struct for it (but using the same set/get() API with reserved keys). The only drawback of doing that is that we statically reserve a bit of space, but I'm not sure that is such a big issue in practice (at least not until this becomes to popular that the space starts to be contended; but surely 256 bytes ought to be enough for everybody, right? :)). > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not > so suitable for nic hw metadata since: > - it grows backward > - it is probably in a different cacheline with respect to xdp_frame > - nic hw metadata will not start at fixed and immutable address, but it depends > on the running ebpf program > > What about having something like: > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end > of the metadata area :)). Here he can reuse the same KV approach if it is fast > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff AFAIU, none of this will live in the (current) XDP metadata area. It will all live just after the xdp_frame struct (so sharing the space with the metadata area in the sense that adding more metadata kv fields will decrease the amount of space that is usable by the current XDP metadata APIs). -Toke
> Lorenzo Bianconi <lorenzo@kernel.org> writes: > > >> > We could combine such a registration API with your header format, so > >> > that the registration just becomes a way of allocating one of the keys > >> > from 0-63 (and the registry just becomes a global copy of the header). > >> > This would basically amount to moving the "service config file" into the > >> > kernel, since that seems to be the only common denominator we can rely > >> > on between BPF applications (as all attempts to write a common daemon > >> > for BPF management have shown). > >> > >> That sounds reasonable. And I guess we'd have set() check the global > >> registry to enforce that the key has been registered beforehand? > >> > >> > > >> > -Toke > >> > >> Thanks for all the feedback! > > > > I like this 'fast' KV approach but I guess we should really evaluate its > > impact on performances (especially for xdp) since, based on the kfunc calls > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > each packet, right? > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > a global registry for offsets would sidestep this, but have the > synchronisation issues we discussed up-thread. So on balance, I think > the memmove() suggestion will probably lead to the least pain. > > For the HW metadata we could sidestep this by always having a fixed > struct for it (but using the same set/get() API with reserved keys). The > only drawback of doing that is that we statically reserve a bit of > space, but I'm not sure that is such a big issue in practice (at least > not until this becomes to popular that the space starts to be contended; > but surely 256 bytes ought to be enough for everybody, right? :)). I am fine with the proposed approach, but I think we need to verify what is the impact on performances (in the worst case??) > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not > > so suitable for nic hw metadata since: > > - it grows backward > > - it is probably in a different cacheline with respect to xdp_frame > > - nic hw metadata will not start at fixed and immutable address, but it depends > > on the running ebpf program > > > > What about having something like: > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end > > of the metadata area :)). Here he can reuse the same KV approach if it is fast > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff > > AFAIU, none of this will live in the (current) XDP metadata area. It > will all live just after the xdp_frame struct (so sharing the space with > the metadata area in the sense that adding more metadata kv fields will > decrease the amount of space that is usable by the current XDP metadata > APIs). > > -Toke > ah, ok. I was thinking the proposed approach was to put them in the current metadata field. Regards, Lorenzo
On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote: > > Thinking about it more, my only relectance for a registration API is how > > to communicate the ID back to other consumers (our discussion below). > > > >> > >> > Dynamically registering fields means you have to share the returned ID > >> > with whoever is interested, which sounds tricky. > >> > If an XDP program sets a field like packet_id, every tracing > >> > program that looks at it, and userspace service, would need to know what > >> > the ID of that field is. > >> > Is there a way to easily share that ID with all of them? > >> > >> Right, so I'll admit this was one of the handwavy bits of my original > >> proposal, but I don't think it's unsolvable. You could do something like > >> (once, on application initialisation): > >> > >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ > >> bpf_map_update(&shared_application_config, &my_key_index, &my_key); > >> > >> and then just get the key out of that map from all programs that want to > >> use it? > > > > Passing it out of band works (whether it's via a pinned map like you > > described, or through other means like a unix socket or some other > > API), it's just more complicated. > > > > Every consumer also needs to know about that API. That won't work with > > standard tools. For example if we set a PACKET_ID KV, maybe we could > > give it to pwru so it could track packets using it? > > Without registering keys, we could pass it as a cli flag. With > > registration, we'd have to have some helper to get the KV ID. > > > > It also introduces ordering dependencies between the services on > > startup, eg packet tracing hooks could only be attached once our XDP > > service has registered a PACKET_ID KV, and they could query it's API. > > Yeah, we definitely need a way to make that accessible and not too > cumbersome. > > I suppose what we really need is a way to map an application-specific > identifier to an ID. And, well, those identifiers could just be (string) > names? That's what we do for CO-RE, after all. So you'd get something > like: > > id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */ > > id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */ > > and we make that idempotent, so that two callers using the same name and > size will just get the same id back; and if called with BPF_NO_CREATE, > you'll get -ENOENT if it hasn't already been registered by someone else? > > We could even make this a sub-command of the bpf() syscall if we want it > to be UAPI, but that's not strictly necessary, applications can also > just call the registration from a syscall program at startup... That's a nice API, it makes sharing the IDs much easier. We still have to worry about collisions (what if two different things want to add their own "packet_id" field?). But at least: * "Any string" has many more possibilities than 0-64 keys. * bpf_register_metadata() could return an error if a field is already registered, instead of silently letting an application overwrite metadata (although arguably we could have add a BPF_NOEXIST style flag to the KV set() to kind of do the same). At least internally, it still feels like we'd maintain a registry of these string fields and make them configurable for each service to avoid collisions. > >> We could combine such a registration API with your header format, so > >> that the registration just becomes a way of allocating one of the keys > >> from 0-63 (and the registry just becomes a global copy of the header). > >> This would basically amount to moving the "service config file" into the > >> kernel, since that seems to be the only common denominator we can rely > >> on between BPF applications (as all attempts to write a common daemon > >> for BPF management have shown). > > > > That sounds reasonable. And I guess we'd have set() check the global > > registry to enforce that the key has been registered beforehand? > > Yes, exactly. And maybe check that the size matches as well just to > remove the obvious footgun of accidentally stepping on each other's > toes? > > > Thanks for all the feedback! > > You're welcome! Thanks for working on this :) > > -Toke
On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > >> > We could combine such a registration API with your header format, so > > >> > that the registration just becomes a way of allocating one of the keys > > >> > from 0-63 (and the registry just becomes a global copy of the header). > > >> > This would basically amount to moving the "service config file" into the > > >> > kernel, since that seems to be the only common denominator we can rely > > >> > on between BPF applications (as all attempts to write a common daemon > > >> > for BPF management have shown). > > >> > > >> That sounds reasonable. And I guess we'd have set() check the global > > >> registry to enforce that the key has been registered beforehand? > > >> > > >> > > > >> > -Toke > > >> > > >> Thanks for all the feedback! > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > impact on performances (especially for xdp) since, based on the kfunc calls > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > > each packet, right? > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > a global registry for offsets would sidestep this, but have the > > synchronisation issues we discussed up-thread. So on balance, I think > > the memmove() suggestion will probably lead to the least pain. > > > > For the HW metadata we could sidestep this by always having a fixed > > struct for it (but using the same set/get() API with reserved keys). The > > only drawback of doing that is that we statically reserve a bit of > > space, but I'm not sure that is such a big issue in practice (at least > > not until this becomes to popular that the space starts to be contended; > > but surely 256 bytes ought to be enough for everybody, right? :)). > > I am fine with the proposed approach, but I think we need to verify what is the > impact on performances (in the worst case??) If drivers are responsible for populating the hardware metadata before XDP, we could make sure drivers set the fields in order to avoid any memove() (and maybe even provide a helper to ensure this?). > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not > > > so suitable for nic hw metadata since: > > > - it grows backward > > > - it is probably in a different cacheline with respect to xdp_frame > > > - nic hw metadata will not start at fixed and immutable address, but it depends > > > on the running ebpf program > > > > > > What about having something like: > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end > > > of the metadata area :)). Here he can reuse the same KV approach if it is fast > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff > > > > AFAIU, none of this will live in the (current) XDP metadata area. It > > will all live just after the xdp_frame struct (so sharing the space with > > the metadata area in the sense that adding more metadata kv fields will > > decrease the amount of space that is usable by the current XDP metadata > > APIs). > > > > -Toke > > > > ah, ok. I was thinking the proposed approach was to put them in the current > metadata field. I've also been thinking of putting this new KV stuff at the start of the headroom (I think that's what you're saying Toke?). It has a few nice advantanges: * It coexists nicely with the current XDP / TC metadata support. Those users won't be able to overwrite / corrupt the KV metadata. KV users won't need to call xdp_adjust_meta() (which would be awkward - how would they know how much space the KV implementation needs). * We don't have to move all the metadata everytime we call xdp_adjust_head() (or the kernel equivalent). Are there any performance implications of that, e.g. for caching? This would also grow "upwards" which is more natural, but I think either way the KV API would hide whether it's downwards or upwards from users. > > Regards, > Lorenzo
> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > > > >> > We could combine such a registration API with your header format, so > > > >> > that the registration just becomes a way of allocating one of the keys > > > >> > from 0-63 (and the registry just becomes a global copy of the header). > > > >> > This would basically amount to moving the "service config file" into the > > > >> > kernel, since that seems to be the only common denominator we can rely > > > >> > on between BPF applications (as all attempts to write a common daemon > > > >> > for BPF management have shown). > > > >> > > > >> That sounds reasonable. And I guess we'd have set() check the global > > > >> registry to enforce that the key has been registered beforehand? > > > >> > > > >> > > > > >> > -Toke > > > >> > > > >> Thanks for all the feedback! > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > > impact on performances (especially for xdp) since, based on the kfunc calls > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > > > each packet, right? > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > > a global registry for offsets would sidestep this, but have the > > > synchronisation issues we discussed up-thread. So on balance, I think > > > the memmove() suggestion will probably lead to the least pain. > > > > > > For the HW metadata we could sidestep this by always having a fixed > > > struct for it (but using the same set/get() API with reserved keys). The > > > only drawback of doing that is that we statically reserve a bit of > > > space, but I'm not sure that is such a big issue in practice (at least > > > not until this becomes to popular that the space starts to be contended; > > > but surely 256 bytes ought to be enough for everybody, right? :)). > > > > I am fine with the proposed approach, but I think we need to verify what is the > > impact on performances (in the worst case??) > > If drivers are responsible for populating the hardware metadata before > XDP, we could make sure drivers set the fields in order to avoid any > memove() (and maybe even provide a helper to ensure this?). nope, since the current APIs introduced by Stanislav are consuming NIC metadata in kfuncs (mainly for af_xdp) and, according to my understanding, we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, timestamping, ..) into the packet (this is what Toke is proposing, right?). In this case kfunc calling order makes a difference. We can think even to add single kfunc to store all the info for all the NIC metadata (maybe via a helping struct) but it seems not scalable to me and we are losing kfunc versatility. Regards, Lorenzo > > > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not > > > > so suitable for nic hw metadata since: > > > > - it grows backward > > > > - it is probably in a different cacheline with respect to xdp_frame > > > > - nic hw metadata will not start at fixed and immutable address, but it depends > > > > on the running ebpf program > > > > > > > > What about having something like: > > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end > > > > of the metadata area :)). Here he can reuse the same KV approach if it is fast > > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff > > > > > > AFAIU, none of this will live in the (current) XDP metadata area. It > > > will all live just after the xdp_frame struct (so sharing the space with > > > the metadata area in the sense that adding more metadata kv fields will > > > decrease the amount of space that is usable by the current XDP metadata > > > APIs). > > > > > > -Toke > > > > > > > ah, ok. I was thinking the proposed approach was to put them in the current > > metadata field. > > I've also been thinking of putting this new KV stuff at the start of the > headroom (I think that's what you're saying Toke?). It has a few nice > advantanges: > > * It coexists nicely with the current XDP / TC metadata support. > Those users won't be able to overwrite / corrupt the KV metadata. > KV users won't need to call xdp_adjust_meta() (which would be awkward - > how would they know how much space the KV implementation needs). > > * We don't have to move all the metadata everytime we call > xdp_adjust_head() (or the kernel equivalent). > > Are there any performance implications of that, e.g. for caching? > > This would also grow "upwards" which is more natural, but I think > either way the KV API would hide whether it's downwards or upwards from > users. > > > > > Regards, > > Lorenzo >
Lorenzo Bianconi <lorenzo@kernel.org> writes: >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: >> > > >> > > >> > We could combine such a registration API with your header format, so >> > > >> > that the registration just becomes a way of allocating one of the keys >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). >> > > >> > This would basically amount to moving the "service config file" into the >> > > >> > kernel, since that seems to be the only common denominator we can rely >> > > >> > on between BPF applications (as all attempts to write a common daemon >> > > >> > for BPF management have shown). >> > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global >> > > >> registry to enforce that the key has been registered beforehand? >> > > >> >> > > >> > >> > > >> > -Toke >> > > >> >> > > >> Thanks for all the feedback! >> > > > >> > > > I like this 'fast' KV approach but I guess we should really evaluate its >> > > > impact on performances (especially for xdp) since, based on the kfunc calls >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for >> > > > each packet, right? >> > > >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using >> > > a global registry for offsets would sidestep this, but have the >> > > synchronisation issues we discussed up-thread. So on balance, I think >> > > the memmove() suggestion will probably lead to the least pain. >> > > >> > > For the HW metadata we could sidestep this by always having a fixed >> > > struct for it (but using the same set/get() API with reserved keys). The >> > > only drawback of doing that is that we statically reserve a bit of >> > > space, but I'm not sure that is such a big issue in practice (at least >> > > not until this becomes to popular that the space starts to be contended; >> > > but surely 256 bytes ought to be enough for everybody, right? :)). >> > >> > I am fine with the proposed approach, but I think we need to verify what is the >> > impact on performances (in the worst case??) >> >> If drivers are responsible for populating the hardware metadata before >> XDP, we could make sure drivers set the fields in order to avoid any >> memove() (and maybe even provide a helper to ensure this?). > > nope, since the current APIs introduced by Stanislav are consuming NIC > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > timestamping, ..) into the packet (this is what Toke is proposing, right?). > In this case kfunc calling order makes a difference. > We can think even to add single kfunc to store all the info for all the NIC > metadata (maybe via a helping struct) but it seems not scalable to me and we > are losing kfunc versatility. Yes, I agree we should have separate kfuncs for each metadata field. Which means it makes a lot of sense to just use the same setter API that we use for the user-registered metadata fields, but using reserved keys. So something like: #define BPF_METADATA_HW_HASH BIT(60) #define BPF_METADATA_HW_TIMESTAMP BIT(61) #define BPF_METADATA_HW_VLAN BIT(62) #define BPF_METADATA_RESERVED (0xffff << 48) bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); As for the internal representation, we can just have the kfunc do something like: int bpf_packet_metadata_set(field_id, value) { switch(field_id) { case BPF_METADATA_HW_HASH: pkt->xdp_hw_meta.hash = value; break; [...] default: /* do the key packing thing */ } } that way the order of setting the HW fields doesn't matter, only the user-defined metadata. >> > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not >> > > > so suitable for nic hw metadata since: >> > > > - it grows backward >> > > > - it is probably in a different cacheline with respect to xdp_frame >> > > > - nic hw metadata will not start at fixed and immutable address, but it depends >> > > > on the running ebpf program >> > > > >> > > > What about having something like: >> > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end >> > > > of the metadata area :)). Here he can reuse the same KV approach if it is fast >> > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff >> > > >> > > AFAIU, none of this will live in the (current) XDP metadata area. It >> > > will all live just after the xdp_frame struct (so sharing the space with >> > > the metadata area in the sense that adding more metadata kv fields will >> > > decrease the amount of space that is usable by the current XDP metadata >> > > APIs). >> > > >> > > -Toke >> > > >> > >> > ah, ok. I was thinking the proposed approach was to put them in the current >> > metadata field. >> >> I've also been thinking of putting this new KV stuff at the start of the >> headroom (I think that's what you're saying Toke?). It has a few nice >> advantanges: >> >> * It coexists nicely with the current XDP / TC metadata support. >> Those users won't be able to overwrite / corrupt the KV metadata. >> KV users won't need to call xdp_adjust_meta() (which would be awkward - >> how would they know how much space the KV implementation needs). Yes, that was what I was saying; we need this to co-exist with the existing xdp_adjust_meta() facility, and moving it back and forth to achieve that seems like a non-starter. So definitely at the start of the headroom (after xdp_frame). >> * We don't have to move all the metadata everytime we call >> xdp_adjust_head() (or the kernel equivalent). >> >> Are there any performance implications of that, e.g. for caching? Well, putting it at the beginning means that the HW metadata (assuming that comes first) will be on the same cache line as the xdp_frame struct itself (and thus should be cache-hot). For user-defined metadata it will depend on the size, of course, it will probably end up stilling into the next cache line (which will affect performance), but I don't think that can be helped... -Toke
"Arthur Fabre" <afabre@cloudflare.com> writes: > On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote: >> > Thinking about it more, my only relectance for a registration API is how >> > to communicate the ID back to other consumers (our discussion below). >> > >> >> >> >> > Dynamically registering fields means you have to share the returned ID >> >> > with whoever is interested, which sounds tricky. >> >> > If an XDP program sets a field like packet_id, every tracing >> >> > program that looks at it, and userspace service, would need to know what >> >> > the ID of that field is. >> >> > Is there a way to easily share that ID with all of them? >> >> >> >> Right, so I'll admit this was one of the handwavy bits of my original >> >> proposal, but I don't think it's unsolvable. You could do something like >> >> (once, on application initialisation): >> >> >> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ >> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key); >> >> >> >> and then just get the key out of that map from all programs that want to >> >> use it? >> > >> > Passing it out of band works (whether it's via a pinned map like you >> > described, or through other means like a unix socket or some other >> > API), it's just more complicated. >> > >> > Every consumer also needs to know about that API. That won't work with >> > standard tools. For example if we set a PACKET_ID KV, maybe we could >> > give it to pwru so it could track packets using it? >> > Without registering keys, we could pass it as a cli flag. With >> > registration, we'd have to have some helper to get the KV ID. >> > >> > It also introduces ordering dependencies between the services on >> > startup, eg packet tracing hooks could only be attached once our XDP >> > service has registered a PACKET_ID KV, and they could query it's API. >> >> Yeah, we definitely need a way to make that accessible and not too >> cumbersome. >> >> I suppose what we really need is a way to map an application-specific >> identifier to an ID. And, well, those identifiers could just be (string) >> names? That's what we do for CO-RE, after all. So you'd get something >> like: >> >> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */ >> >> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */ >> >> and we make that idempotent, so that two callers using the same name and >> size will just get the same id back; and if called with BPF_NO_CREATE, >> you'll get -ENOENT if it hasn't already been registered by someone else? >> >> We could even make this a sub-command of the bpf() syscall if we want it >> to be UAPI, but that's not strictly necessary, applications can also >> just call the registration from a syscall program at startup... > > That's a nice API, it makes sharing the IDs much easier. > > We still have to worry about collisions (what if two different things > want to add their own "packet_id" field?). But at least: > > * "Any string" has many more possibilities than 0-64 keys. Yes, and it's easy to namespace (by prefixing, like APPNAME_my_metaname). But sure, if everyone uses very generic names that will be a problem. > * bpf_register_metadata() could return an error if a field is already > registered, instead of silently letting an application overwrite > metadata I don't think we can fundamentally solve the collision problem if we also need to allow sharing keys (on purpose). I.e., we can't distinguish between "these two applications deliberately want to share the packet_id field" and "these two applications accidentally both picked packet_id as their internal key". I suppose we could pre-define some extra reserved keys if there turns out to be widely used identifiers that many applications want. But I'm not sure if that should be there from the start, it quickly becomes very speculative(packet_id comes to mind as one that might be generally useful, but I'm really not sure, TBH). > (although arguably we could have add a BPF_NOEXIST style flag > to the KV set() to kind of do the same). A global registry will need locking, so not sure it's a good idea to support inserting into it in the fast path? > At least internally, it still feels like we'd maintain a registry of > these string fields and make them configurable for each service to avoid > collisions. Yeah, see above. Some kind of coordination (like a registry) is impossible to avoid if you *want* to share data, but not sure how common that is? -Toke
On 10/01, Toke Høiland-Jørgensen wrote: > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > >> > > > >> > > >> > We could combine such a registration API with your header format, so > >> > > >> > that the registration just becomes a way of allocating one of the keys > >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). > >> > > >> > This would basically amount to moving the "service config file" into the > >> > > >> > kernel, since that seems to be the only common denominator we can rely > >> > > >> > on between BPF applications (as all attempts to write a common daemon > >> > > >> > for BPF management have shown). > >> > > >> > >> > > >> That sounds reasonable. And I guess we'd have set() check the global > >> > > >> registry to enforce that the key has been registered beforehand? > >> > > >> > >> > > >> > > >> > > >> > -Toke > >> > > >> > >> > > >> Thanks for all the feedback! > >> > > > > >> > > > I like this 'fast' KV approach but I guess we should really evaluate its > >> > > > impact on performances (especially for xdp) since, based on the kfunc calls > >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > >> > > > each packet, right? > >> > > > >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > >> > > a global registry for offsets would sidestep this, but have the > >> > > synchronisation issues we discussed up-thread. So on balance, I think > >> > > the memmove() suggestion will probably lead to the least pain. > >> > > > >> > > For the HW metadata we could sidestep this by always having a fixed > >> > > struct for it (but using the same set/get() API with reserved keys). The > >> > > only drawback of doing that is that we statically reserve a bit of > >> > > space, but I'm not sure that is such a big issue in practice (at least > >> > > not until this becomes to popular that the space starts to be contended; > >> > > but surely 256 bytes ought to be enough for everybody, right? :)). > >> > > >> > I am fine with the proposed approach, but I think we need to verify what is the > >> > impact on performances (in the worst case??) > >> > >> If drivers are responsible for populating the hardware metadata before > >> XDP, we could make sure drivers set the fields in order to avoid any > >> memove() (and maybe even provide a helper to ensure this?). > > > > nope, since the current APIs introduced by Stanislav are consuming NIC > > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > > timestamping, ..) into the packet (this is what Toke is proposing, right?). > > In this case kfunc calling order makes a difference. > > We can think even to add single kfunc to store all the info for all the NIC > > metadata (maybe via a helping struct) but it seems not scalable to me and we > > are losing kfunc versatility. > > Yes, I agree we should have separate kfuncs for each metadata field. > Which means it makes a lot of sense to just use the same setter API that > we use for the user-registered metadata fields, but using reserved keys. > So something like: > > #define BPF_METADATA_HW_HASH BIT(60) > #define BPF_METADATA_HW_TIMESTAMP BIT(61) > #define BPF_METADATA_HW_VLAN BIT(62) > #define BPF_METADATA_RESERVED (0xffff << 48) > > bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); > > > As for the internal representation, we can just have the kfunc do > something like: > > int bpf_packet_metadata_set(field_id, value) { > switch(field_id) { > case BPF_METADATA_HW_HASH: > pkt->xdp_hw_meta.hash = value; > break; > [...] > default: > /* do the key packing thing */ > } > } > > > that way the order of setting the HW fields doesn't matter, only the > user-defined metadata. Can you expand on why we need the flexibility of picking the metadata fields here? Presumably we are talking about the use-cases where the XDP program is doing redirect/pass and it doesn't really know who's the final consumer is (might be another xdp program or might be the xdp->skb kernel case), so the only sensible option here seems to be store everything?
Stanislav Fomichev <stfomichev@gmail.com> writes: > On 10/01, Toke Høiland-Jørgensen wrote: >> Lorenzo Bianconi <lorenzo@kernel.org> writes: >> >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: >> >> > > >> >> > > >> > We could combine such a registration API with your header format, so >> >> > > >> > that the registration just becomes a way of allocating one of the keys >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). >> >> > > >> > This would basically amount to moving the "service config file" into the >> >> > > >> > kernel, since that seems to be the only common denominator we can rely >> >> > > >> > on between BPF applications (as all attempts to write a common daemon >> >> > > >> > for BPF management have shown). >> >> > > >> >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global >> >> > > >> registry to enforce that the key has been registered beforehand? >> >> > > >> >> >> > > >> > >> >> > > >> > -Toke >> >> > > >> >> >> > > >> Thanks for all the feedback! >> >> > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for >> >> > > > each packet, right? >> >> > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using >> >> > > a global registry for offsets would sidestep this, but have the >> >> > > synchronisation issues we discussed up-thread. So on balance, I think >> >> > > the memmove() suggestion will probably lead to the least pain. >> >> > > >> >> > > For the HW metadata we could sidestep this by always having a fixed >> >> > > struct for it (but using the same set/get() API with reserved keys). The >> >> > > only drawback of doing that is that we statically reserve a bit of >> >> > > space, but I'm not sure that is such a big issue in practice (at least >> >> > > not until this becomes to popular that the space starts to be contended; >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)). >> >> > >> >> > I am fine with the proposed approach, but I think we need to verify what is the >> >> > impact on performances (in the worst case??) >> >> >> >> If drivers are responsible for populating the hardware metadata before >> >> XDP, we could make sure drivers set the fields in order to avoid any >> >> memove() (and maybe even provide a helper to ensure this?). >> > >> > nope, since the current APIs introduced by Stanislav are consuming NIC >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, >> > timestamping, ..) into the packet (this is what Toke is proposing, right?). >> > In this case kfunc calling order makes a difference. >> > We can think even to add single kfunc to store all the info for all the NIC >> > metadata (maybe via a helping struct) but it seems not scalable to me and we >> > are losing kfunc versatility. >> >> Yes, I agree we should have separate kfuncs for each metadata field. >> Which means it makes a lot of sense to just use the same setter API that >> we use for the user-registered metadata fields, but using reserved keys. >> So something like: >> >> #define BPF_METADATA_HW_HASH BIT(60) >> #define BPF_METADATA_HW_TIMESTAMP BIT(61) >> #define BPF_METADATA_HW_VLAN BIT(62) >> #define BPF_METADATA_RESERVED (0xffff << 48) >> >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); >> >> >> As for the internal representation, we can just have the kfunc do >> something like: >> >> int bpf_packet_metadata_set(field_id, value) { >> switch(field_id) { >> case BPF_METADATA_HW_HASH: >> pkt->xdp_hw_meta.hash = value; >> break; >> [...] >> default: >> /* do the key packing thing */ >> } >> } >> >> >> that way the order of setting the HW fields doesn't matter, only the >> user-defined metadata. > > Can you expand on why we need the flexibility of picking the metadata fields > here? Presumably we are talking about the use-cases where the XDP program > is doing redirect/pass and it doesn't really know who's the final > consumer is (might be another xdp program or might be the xdp->skb > kernel case), so the only sensible option here seems to be store everything? For the same reason that we have separate kfuncs for each metadata field when getting it from the driver: XDP programs should have the flexibility to decide which pieces of metadata they need, and skip the overhead of stuff that is not needed. For instance, say an XDP program knows that nothing in the system uses timestamps; in that case, it can skip both the getter and the setter call for timestamps. I suppose we *could* support just a single call to set the skb meta, like: bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data); ...but in that case, we'd need to support some fields being unset anyway, and the program would have to populate the struct on the stack before performing the call. So it seems simpler to just have symmetry between the get (from HW) and set side? :) -Toke
On 10/02, Toke Høiland-Jørgensen wrote: > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > On 10/01, Toke Høiland-Jørgensen wrote: > >> Lorenzo Bianconi <lorenzo@kernel.org> writes: > >> > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > >> >> > > > >> >> > > >> > We could combine such a registration API with your header format, so > >> >> > > >> > that the registration just becomes a way of allocating one of the keys > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). > >> >> > > >> > This would basically amount to moving the "service config file" into the > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon > >> >> > > >> > for BPF management have shown). > >> >> > > >> > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global > >> >> > > >> registry to enforce that the key has been registered beforehand? > >> >> > > >> > >> >> > > >> > > >> >> > > >> > -Toke > >> >> > > >> > >> >> > > >> Thanks for all the feedback! > >> >> > > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > >> >> > > > each packet, right? > >> >> > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > >> >> > > a global registry for offsets would sidestep this, but have the > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think > >> >> > > the memmove() suggestion will probably lead to the least pain. > >> >> > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed > >> >> > > struct for it (but using the same set/get() API with reserved keys). The > >> >> > > only drawback of doing that is that we statically reserve a bit of > >> >> > > space, but I'm not sure that is such a big issue in practice (at least > >> >> > > not until this becomes to popular that the space starts to be contended; > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)). > >> >> > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the > >> >> > impact on performances (in the worst case??) > >> >> > >> >> If drivers are responsible for populating the hardware metadata before > >> >> XDP, we could make sure drivers set the fields in order to avoid any > >> >> memove() (and maybe even provide a helper to ensure this?). > >> > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?). > >> > In this case kfunc calling order makes a difference. > >> > We can think even to add single kfunc to store all the info for all the NIC > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we > >> > are losing kfunc versatility. > >> > >> Yes, I agree we should have separate kfuncs for each metadata field. > >> Which means it makes a lot of sense to just use the same setter API that > >> we use for the user-registered metadata fields, but using reserved keys. > >> So something like: > >> > >> #define BPF_METADATA_HW_HASH BIT(60) > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61) > >> #define BPF_METADATA_HW_VLAN BIT(62) > >> #define BPF_METADATA_RESERVED (0xffff << 48) > >> > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); > >> > >> > >> As for the internal representation, we can just have the kfunc do > >> something like: > >> > >> int bpf_packet_metadata_set(field_id, value) { > >> switch(field_id) { > >> case BPF_METADATA_HW_HASH: > >> pkt->xdp_hw_meta.hash = value; > >> break; > >> [...] > >> default: > >> /* do the key packing thing */ > >> } > >> } > >> > >> > >> that way the order of setting the HW fields doesn't matter, only the > >> user-defined metadata. > > > > Can you expand on why we need the flexibility of picking the metadata fields > > here? Presumably we are talking about the use-cases where the XDP program > > is doing redirect/pass and it doesn't really know who's the final > > consumer is (might be another xdp program or might be the xdp->skb > > kernel case), so the only sensible option here seems to be store everything? > > For the same reason that we have separate kfuncs for each metadata field > when getting it from the driver: XDP programs should have the > flexibility to decide which pieces of metadata they need, and skip the > overhead of stuff that is not needed. > > For instance, say an XDP program knows that nothing in the system uses > timestamps; in that case, it can skip both the getter and the setter > call for timestamps. But doesn't it put us in the same place? Where the first (native) xdp program needs to know which metadata the final consumer wants. At this point why not propagate metadata layout as well? (or maybe I'm still missing what exact use-case we are trying to solve) > I suppose we *could* support just a single call to set the skb meta, > like: > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data); > > ...but in that case, we'd need to support some fields being unset > anyway, and the program would have to populate the struct on the stack > before performing the call. So it seems simpler to just have symmetry > between the get (from HW) and set side? :) Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store the metadata somewhere in xdp_md directly? (also presumably by reusing most of the existing kfuncs/xmo_xxx helpers)
On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: > On 10/02, Toke Høiland-Jørgensen wrote: > > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > > > On 10/01, Toke Høiland-Jørgensen wrote: > > >> Lorenzo Bianconi <lorenzo@kernel.org> writes: > > >> > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > >> >> > > > > >> >> > > >> > We could combine such a registration API with your header format, so > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). > > >> >> > > >> > This would basically amount to moving the "service config file" into the > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon > > >> >> > > >> > for BPF management have shown). > > >> >> > > >> > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global > > >> >> > > >> registry to enforce that the key has been registered beforehand? > > >> >> > > >> > > >> >> > > >> > > > >> >> > > >> > -Toke > > >> >> > > >> > > >> >> > > >> Thanks for all the feedback! > > >> >> > > > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > >> >> > > > each packet, right? > > >> >> > > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > >> >> > > a global registry for offsets would sidestep this, but have the > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think > > >> >> > > the memmove() suggestion will probably lead to the least pain. > > >> >> > > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The > > >> >> > > only drawback of doing that is that we statically reserve a bit of > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least > > >> >> > > not until this becomes to popular that the space starts to be contended; > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)). > > >> >> > > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the > > >> >> > impact on performances (in the worst case??) > > >> >> > > >> >> If drivers are responsible for populating the hardware metadata before > > >> >> XDP, we could make sure drivers set the fields in order to avoid any > > >> >> memove() (and maybe even provide a helper to ensure this?). > > >> > > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?). > > >> > In this case kfunc calling order makes a difference. > > >> > We can think even to add single kfunc to store all the info for all the NIC > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we > > >> > are losing kfunc versatility. > > >> > > >> Yes, I agree we should have separate kfuncs for each metadata field. > > >> Which means it makes a lot of sense to just use the same setter API that > > >> we use for the user-registered metadata fields, but using reserved keys. > > >> So something like: > > >> > > >> #define BPF_METADATA_HW_HASH BIT(60) > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61) > > >> #define BPF_METADATA_HW_VLAN BIT(62) > > >> #define BPF_METADATA_RESERVED (0xffff << 48) > > >> > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); > > >> > > >> > > >> As for the internal representation, we can just have the kfunc do > > >> something like: > > >> > > >> int bpf_packet_metadata_set(field_id, value) { > > >> switch(field_id) { > > >> case BPF_METADATA_HW_HASH: > > >> pkt->xdp_hw_meta.hash = value; > > >> break; > > >> [...] > > >> default: > > >> /* do the key packing thing */ > > >> } > > >> } > > >> > > >> > > >> that way the order of setting the HW fields doesn't matter, only the > > >> user-defined metadata. > > > > > > Can you expand on why we need the flexibility of picking the metadata fields > > > here? Presumably we are talking about the use-cases where the XDP program > > > is doing redirect/pass and it doesn't really know who's the final > > > consumer is (might be another xdp program or might be the xdp->skb > > > kernel case), so the only sensible option here seems to be store everything? > > > > For the same reason that we have separate kfuncs for each metadata field > > when getting it from the driver: XDP programs should have the > > flexibility to decide which pieces of metadata they need, and skip the > > overhead of stuff that is not needed. > > > > For instance, say an XDP program knows that nothing in the system uses > > timestamps; in that case, it can skip both the getter and the setter > > call for timestamps. > > But doesn't it put us in the same place? Where the first (native) xdp program > needs to know which metadata the final consumer wants. At this point > why not propagate metadata layout as well? > > (or maybe I'm still missing what exact use-case we are trying to solve) There are two different use-cases for the metadata: * "Hardware" metadata (like the hash, rx_timestamp...). There are only a few well known fields, and only XDP can access them to set them as metadata, so storing them in a struct somewhere could make sense. * Arbitrary metadata used by services. Eg a TC filter could set a field describing which service a packet is for, and that could be reused for iptables, routing, socket dispatch... Similarly we could set a "packet_id" field that uniquely identifies a packet so we can trace it throughout the network stack (through clones, encap, decap, userspace services...). The skb->mark, but with more room, and better support for sharing it. We can only know the layout ahead of time for the first one. And they're similar enough in their requirements (need to be stored somewhere in the SKB, have a way of retrieving each one individually, that it seems to make sense to use a common API). > > > I suppose we *could* support just a single call to set the skb meta, > > like: > > > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data); > > > > ...but in that case, we'd need to support some fields being unset > > anyway, and the program would have to populate the struct on the stack > > before performing the call. So it seems simpler to just have symmetry > > between the get (from HW) and set side? :) > > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store > the metadata somewhere in xdp_md directly? (also presumably by > reusing most of the existing kfuncs/xmo_xxx helpers) If we store it in xdp_md, the metadata won't be available higher up the stack (or am I missing something?). I think one of the goals is to let things other than XDP access it (maybe even the network stack itself?).
On Tue Oct 1, 2024 at 5:28 PM CEST, Toke Høiland-Jørgensen wrote: > "Arthur Fabre" <afabre@cloudflare.com> writes: > > > On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote: > >> > Thinking about it more, my only relectance for a registration API is how > >> > to communicate the ID back to other consumers (our discussion below). > >> > > >> >> > >> >> > Dynamically registering fields means you have to share the returned ID > >> >> > with whoever is interested, which sounds tricky. > >> >> > If an XDP program sets a field like packet_id, every tracing > >> >> > program that looks at it, and userspace service, would need to know what > >> >> > the ID of that field is. > >> >> > Is there a way to easily share that ID with all of them? > >> >> > >> >> Right, so I'll admit this was one of the handwavy bits of my original > >> >> proposal, but I don't think it's unsolvable. You could do something like > >> >> (once, on application initialisation): > >> >> > >> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */ > >> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key); > >> >> > >> >> and then just get the key out of that map from all programs that want to > >> >> use it? > >> > > >> > Passing it out of band works (whether it's via a pinned map like you > >> > described, or through other means like a unix socket or some other > >> > API), it's just more complicated. > >> > > >> > Every consumer also needs to know about that API. That won't work with > >> > standard tools. For example if we set a PACKET_ID KV, maybe we could > >> > give it to pwru so it could track packets using it? > >> > Without registering keys, we could pass it as a cli flag. With > >> > registration, we'd have to have some helper to get the KV ID. > >> > > >> > It also introduces ordering dependencies between the services on > >> > startup, eg packet tracing hooks could only be attached once our XDP > >> > service has registered a PACKET_ID KV, and they could query it's API. > >> > >> Yeah, we definitely need a way to make that accessible and not too > >> cumbersome. > >> > >> I suppose what we really need is a way to map an application-specific > >> identifier to an ID. And, well, those identifiers could just be (string) > >> names? That's what we do for CO-RE, after all. So you'd get something > >> like: > >> > >> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */ > >> > >> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */ > >> > >> and we make that idempotent, so that two callers using the same name and > >> size will just get the same id back; and if called with BPF_NO_CREATE, > >> you'll get -ENOENT if it hasn't already been registered by someone else? > >> > >> We could even make this a sub-command of the bpf() syscall if we want it > >> to be UAPI, but that's not strictly necessary, applications can also > >> just call the registration from a syscall program at startup... > > > > That's a nice API, it makes sharing the IDs much easier. > > > > We still have to worry about collisions (what if two different things > > want to add their own "packet_id" field?). But at least: > > > > * "Any string" has many more possibilities than 0-64 keys. > > Yes, and it's easy to namespace (by prefixing, like > APPNAME_my_metaname). But sure, if everyone uses very generic names that > will be a problem. > > > * bpf_register_metadata() could return an error if a field is already > > registered, instead of silently letting an application overwrite > > metadata > > I don't think we can fundamentally solve the collision problem if we > also need to allow sharing keys (on purpose). I.e., we can't distinguish > between "these two applications deliberately want to share the packet_id > field" and "these two applications accidentally both picked packet_id as > their internal key". Good point. We either have to be happy with sharing the small keys space, or sharing the much bigger string space. > I suppose we could pre-define some extra reserved keys if there turns > out to be widely used identifiers that many applications want. But I'm > not sure if that should be there from the start, it quickly becomes very > speculative(packet_id comes to mind as one that might be generally > useful, but I'm really not sure, TBH). > > > (although arguably we could have add a BPF_NOEXIST style flag > > to the KV set() to kind of do the same). > > A global registry will need locking, so not sure it's a good idea to > support inserting into it in the fast path? (I meant just checking if a KV with that value has been set already or not, in the case where we don't have a registration API). That raises an interesting question: we probably won't be able to ensure that the keys passed to set() have been registered ahead of time. That would require checking the locked global registry as you point out. Misbehaving users could just skip calling register() altogether, and directly pick a random key to use. Maybe we could solve this by having a pair of atomic u64s per thread storing the KV header to describe which keys are allowed to be set, and what size they'll have? But that's starting to feel complicated. (Same for the size parameter in register() - we won't be able to enforce that that is actually the size then passed to set(). But I think we can just drop it - anyways we can't check the size ahead of time because we can't know about adjust_head() / expand_head() calls). > > > At least internally, it still feels like we'd maintain a registry of > > these string fields and make them configurable for each service to avoid > > collisions. > > Yeah, see above. Some kind of coordination (like a registry) is > impossible to avoid if you *want* to share data, but not sure how > common that is? > > -Toke
On 10/03, Arthur Fabre wrote: > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: > > On 10/02, Toke Høiland-Jørgensen wrote: > > > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > > > > > On 10/01, Toke Høiland-Jørgensen wrote: > > > >> Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > >> > > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > >> >> > > > > > >> >> > > >> > We could combine such a registration API with your header format, so > > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys > > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). > > > >> >> > > >> > This would basically amount to moving the "service config file" into the > > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely > > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon > > > >> >> > > >> > for BPF management have shown). > > > >> >> > > >> > > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global > > > >> >> > > >> registry to enforce that the key has been registered beforehand? > > > >> >> > > >> > > > >> >> > > >> > > > > >> >> > > >> > -Toke > > > >> >> > > >> > > > >> >> > > >> Thanks for all the feedback! > > > >> >> > > > > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls > > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > > >> >> > > > each packet, right? > > > >> >> > > > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > > >> >> > > a global registry for offsets would sidestep this, but have the > > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think > > > >> >> > > the memmove() suggestion will probably lead to the least pain. > > > >> >> > > > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed > > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The > > > >> >> > > only drawback of doing that is that we statically reserve a bit of > > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least > > > >> >> > > not until this becomes to popular that the space starts to be contended; > > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)). > > > >> >> > > > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the > > > >> >> > impact on performances (in the worst case??) > > > >> >> > > > >> >> If drivers are responsible for populating the hardware metadata before > > > >> >> XDP, we could make sure drivers set the fields in order to avoid any > > > >> >> memove() (and maybe even provide a helper to ensure this?). > > > >> > > > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC > > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?). > > > >> > In this case kfunc calling order makes a difference. > > > >> > We can think even to add single kfunc to store all the info for all the NIC > > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we > > > >> > are losing kfunc versatility. > > > >> > > > >> Yes, I agree we should have separate kfuncs for each metadata field. > > > >> Which means it makes a lot of sense to just use the same setter API that > > > >> we use for the user-registered metadata fields, but using reserved keys. > > > >> So something like: > > > >> > > > >> #define BPF_METADATA_HW_HASH BIT(60) > > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61) > > > >> #define BPF_METADATA_HW_VLAN BIT(62) > > > >> #define BPF_METADATA_RESERVED (0xffff << 48) > > > >> > > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); > > > >> > > > >> > > > >> As for the internal representation, we can just have the kfunc do > > > >> something like: > > > >> > > > >> int bpf_packet_metadata_set(field_id, value) { > > > >> switch(field_id) { > > > >> case BPF_METADATA_HW_HASH: > > > >> pkt->xdp_hw_meta.hash = value; > > > >> break; > > > >> [...] > > > >> default: > > > >> /* do the key packing thing */ > > > >> } > > > >> } > > > >> > > > >> > > > >> that way the order of setting the HW fields doesn't matter, only the > > > >> user-defined metadata. > > > > > > > > Can you expand on why we need the flexibility of picking the metadata fields > > > > here? Presumably we are talking about the use-cases where the XDP program > > > > is doing redirect/pass and it doesn't really know who's the final > > > > consumer is (might be another xdp program or might be the xdp->skb > > > > kernel case), so the only sensible option here seems to be store everything? > > > > > > For the same reason that we have separate kfuncs for each metadata field > > > when getting it from the driver: XDP programs should have the > > > flexibility to decide which pieces of metadata they need, and skip the > > > overhead of stuff that is not needed. > > > > > > For instance, say an XDP program knows that nothing in the system uses > > > timestamps; in that case, it can skip both the getter and the setter > > > call for timestamps. Original RFC is talking about XDP -> XDP_REDIRECT -> skb use-case, right? For this we pretty much know what kind of metadata we want to preserve, so why not ship it in the existing metadata area and have a kfunc that the xdp program will call prior to doing xdp_redirect? This kfunc can do exactly what you're suggesting - skip the timestamp if we know that the timestamping is off. Or have we moved to discussing some other use-cases? What am I missing about the need for some other new mechanism? > > But doesn't it put us in the same place? Where the first (native) xdp program > > needs to know which metadata the final consumer wants. At this point > > why not propagate metadata layout as well? > > > > (or maybe I'm still missing what exact use-case we are trying to solve) > > There are two different use-cases for the metadata: > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > few well known fields, and only XDP can access them to set them as > metadata, so storing them in a struct somewhere could make sense. > > * Arbitrary metadata used by services. Eg a TC filter could set a field > describing which service a packet is for, and that could be reused for > iptables, routing, socket dispatch... > Similarly we could set a "packet_id" field that uniquely identifies a > packet so we can trace it throughout the network stack (through > clones, encap, decap, userspace services...). > The skb->mark, but with more room, and better support for sharing it. > > We can only know the layout ahead of time for the first one. And they're > similar enough in their requirements (need to be stored somewhere in the > SKB, have a way of retrieving each one individually, that it seems to > make sense to use a common API). Why not have the following layout then? +---------------+-------------------+----------------------------------------+------+ | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | +---------------+-------------------+----------------------------------------+------+ ^ ^ data_meta data You obviously still have a problem of communicating the layout if you have some redirects in between, but you, in theory still have this problem with user-defined metadata anyway (unless I'm missing something). > > > I suppose we *could* support just a single call to set the skb meta, > > > like: > > > > > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data); > > > > > > ...but in that case, we'd need to support some fields being unset > > > anyway, and the program would have to populate the struct on the stack > > > before performing the call. So it seems simpler to just have symmetry > > > between the get (from HW) and set side? :) > > > > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store > > the metadata somewhere in xdp_md directly? (also presumably by > > reusing most of the existing kfuncs/xmo_xxx helpers) > > If we store it in xdp_md, the metadata won't be available higher up the > stack (or am I missing something?). I think one of the goals is to let > things other than XDP access it (maybe even the network stack itself?). IIRC, xdp metadata gets copied to skb metadata, so it does propagate. Although, it might have a detrimental effect on the gro, but I'm assuming that is something that can be fixed separately.
Hi Stan, On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote: > On 10/03, Arthur Fabre wrote: > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: > > > On 10/02, Toke Høiland-Jørgensen wrote: > > > > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > > > > > > > On 10/01, Toke Høiland-Jørgensen wrote: > > > > >> Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > >> > > > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > > > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > >> >> > > > > > > >> >> > > >> > We could combine such a registration API with your header format, so > > > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys > > > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). > > > > >> >> > > >> > This would basically amount to moving the "service config file" into the > > > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely > > > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon > > > > >> >> > > >> > for BPF management have shown). > > > > >> >> > > >> > > > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global > > > > >> >> > > >> registry to enforce that the key has been registered beforehand? > > > > >> >> > > >> > > > > >> >> > > >> > > > > > >> >> > > >> > -Toke > > > > >> >> > > >> > > > > >> >> > > >> Thanks for all the feedback! > > > > >> >> > > > > > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls > > > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > > > >> >> > > > each packet, right? > > > > >> >> > > > > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > > > >> >> > > a global registry for offsets would sidestep this, but have the > > > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think > > > > >> >> > > the memmove() suggestion will probably lead to the least pain. > > > > >> >> > > > > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed > > > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The > > > > >> >> > > only drawback of doing that is that we statically reserve a bit of > > > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least > > > > >> >> > > not until this becomes to popular that the space starts to be contended; > > > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)). > > > > >> >> > > > > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the > > > > >> >> > impact on performances (in the worst case??) > > > > >> >> > > > > >> >> If drivers are responsible for populating the hardware metadata before > > > > >> >> XDP, we could make sure drivers set the fields in order to avoid any > > > > >> >> memove() (and maybe even provide a helper to ensure this?). > > > > >> > > > > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC > > > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > > > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > > > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?). > > > > >> > In this case kfunc calling order makes a difference. > > > > >> > We can think even to add single kfunc to store all the info for all the NIC > > > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we > > > > >> > are losing kfunc versatility. > > > > >> > > > > >> Yes, I agree we should have separate kfuncs for each metadata field. > > > > >> Which means it makes a lot of sense to just use the same setter API that > > > > >> we use for the user-registered metadata fields, but using reserved keys. > > > > >> So something like: > > > > >> > > > > >> #define BPF_METADATA_HW_HASH BIT(60) > > > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61) > > > > >> #define BPF_METADATA_HW_VLAN BIT(62) > > > > >> #define BPF_METADATA_RESERVED (0xffff << 48) > > > > >> > > > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); > > > > >> > > > > >> > > > > >> As for the internal representation, we can just have the kfunc do > > > > >> something like: > > > > >> > > > > >> int bpf_packet_metadata_set(field_id, value) { > > > > >> switch(field_id) { > > > > >> case BPF_METADATA_HW_HASH: > > > > >> pkt->xdp_hw_meta.hash = value; > > > > >> break; > > > > >> [...] > > > > >> default: > > > > >> /* do the key packing thing */ > > > > >> } > > > > >> } > > > > >> > > > > >> > > > > >> that way the order of setting the HW fields doesn't matter, only the > > > > >> user-defined metadata. > > > > > > > > > > Can you expand on why we need the flexibility of picking the metadata fields > > > > > here? Presumably we are talking about the use-cases where the XDP program > > > > > is doing redirect/pass and it doesn't really know who's the final > > > > > consumer is (might be another xdp program or might be the xdp->skb > > > > > kernel case), so the only sensible option here seems to be store everything? > > > > > > > > For the same reason that we have separate kfuncs for each metadata field > > > > when getting it from the driver: XDP programs should have the > > > > flexibility to decide which pieces of metadata they need, and skip the > > > > overhead of stuff that is not needed. > > > > > > > > For instance, say an XDP program knows that nothing in the system uses > > > > timestamps; in that case, it can skip both the getter and the setter > > > > call for timestamps. > > Original RFC is talking about XDP -> XDP_REDIRECT -> skb use-case, > right? For this we pretty much know what kind of metadata we want to > preserve, so why not ship it in the existing metadata area and have > a kfunc that the xdp program will call prior to doing xdp_redirect? > This kfunc can do exactly what you're suggesting - skip the timestamp > if we know that the timestamping is off. > > Or have we moved to discussing some other use-cases? What am I missing > about the need for some other new mechanism? > > > > But doesn't it put us in the same place? Where the first (native) xdp program > > > needs to know which metadata the final consumer wants. At this point > > > why not propagate metadata layout as well? > > > > > > (or maybe I'm still missing what exact use-case we are trying to solve) > > > > There are two different use-cases for the metadata: > > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > > few well known fields, and only XDP can access them to set them as > > metadata, so storing them in a struct somewhere could make sense. > > > > * Arbitrary metadata used by services. Eg a TC filter could set a field > > describing which service a packet is for, and that could be reused for > > iptables, routing, socket dispatch... > > Similarly we could set a "packet_id" field that uniquely identifies a > > packet so we can trace it throughout the network stack (through > > clones, encap, decap, userspace services...). > > The skb->mark, but with more room, and better support for sharing it. > > > > We can only know the layout ahead of time for the first one. And they're > > similar enough in their requirements (need to be stored somewhere in the > > SKB, have a way of retrieving each one individually, that it seems to > > make sense to use a common API). > > Why not have the following layout then? > > +---------------+-------------------+----------------------------------------+------+ > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > +---------------+-------------------+----------------------------------------+------+ > ^ ^ > data_meta data > > You obviously still have a problem of communicating the layout if you > have some redirects in between, but you, in theory still have this > problem with user-defined metadata anyway (unless I'm missing > something). > > > > > I suppose we *could* support just a single call to set the skb meta, > > > > like: > > > > > > > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data); > > > > > > > > ...but in that case, we'd need to support some fields being unset > > > > anyway, and the program would have to populate the struct on the stack > > > > before performing the call. So it seems simpler to just have symmetry > > > > between the get (from HW) and set side? :) > > > > > > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store > > > the metadata somewhere in xdp_md directly? (also presumably by > > > reusing most of the existing kfuncs/xmo_xxx helpers) > > > > If we store it in xdp_md, the metadata won't be available higher up the > > stack (or am I missing something?). I think one of the goals is to let > > things other than XDP access it (maybe even the network stack itself?). > > IIRC, xdp metadata gets copied to skb metadata, so it does propagate. > Although, it might have a detrimental effect on the gro, but I'm > assuming that is something that can be fixed separately. I was thinking about this today so I'm glad you brought it up. IIUC putting unique data in the metadata area will prevent GRO from working. I wonder if as a part of this work there's a cleaner way to indicate to XDP or GRO engine that some metadata should be ignored for coalescing purposes. Otherwise the final XDP prog before GRO would need to memset() all the relevant bytes to 0 (assuming that even works). Thanks, Daniel
On 04/10/2024 04.13, Daniel Xu wrote: > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote: >> On 10/03, Arthur Fabre wrote: >>> On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: >>>> On 10/02, Toke Høiland-Jørgensen wrote: >>>>> Stanislav Fomichev <stfomichev@gmail.com> writes: >>>>> >>>>>> On 10/01, Toke Høiland-Jørgensen wrote: >>>>>>> Lorenzo Bianconi <lorenzo@kernel.org> writes: >>>>>>> >>>>>>>>> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: >>>>>>>>>>> Lorenzo Bianconi <lorenzo@kernel.org> writes: >>>>>>>>>>> [...] >>>>>>>>>>>> >>>>>>>>>>>> I like this 'fast' KV approach but I guess we should really evaluate its >>>>>>>>>>>> impact on performances (especially for xdp) since, based on the kfunc calls >>>>>>>>>>>> order in the ebpf program, we can have one or multiple memmove/memcpy for >>>>>>>>>>>> each packet, right? >>>>>>>>>>> >>>>>>>>>>> Yes, with Arthur's scheme, performance will be ordering dependent. Using I really like the *compact* Key-Value (KV) store idea from Arthur. - The question is it is fast enough? I've promised Arthur to XDP micro-benchmark this, if he codes this up to be usable in the XDP code path. Listening to the LPC recording I heard that Alexei also saw potential and other use-case for this kind of fast-and-compact KV approach. I have high hopes for the performance, as Arthur uses POPCNT instruction which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1 and Reciprocal throughput 0.25. [1] https://www.agner.org/optimize/blog/read.php?i=853#848 [2] https://www.agner.org/optimize/instruction_tables.pdf [...] >>> There are two different use-cases for the metadata: >>> >>> * "Hardware" metadata (like the hash, rx_timestamp...). There are only a >>> few well known fields, and only XDP can access them to set them as >>> metadata, so storing them in a struct somewhere could make sense. >>> >>> * Arbitrary metadata used by services. Eg a TC filter could set a field >>> describing which service a packet is for, and that could be reused for >>> iptables, routing, socket dispatch... >>> Similarly we could set a "packet_id" field that uniquely identifies a >>> packet so we can trace it throughout the network stack (through >>> clones, encap, decap, userspace services...). >>> The skb->mark, but with more room, and better support for sharing it. >>> >>> We can only know the layout ahead of time for the first one. And they're >>> similar enough in their requirements (need to be stored somewhere in the >>> SKB, have a way of retrieving each one individually, that it seems to >>> make sense to use a common API). >> >> Why not have the following layout then? >> >> +---------------+-------------------+----------------------------------------+------+ >> | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | >> +---------------+-------------------+----------------------------------------+------+ >> ^ ^ >> data_meta data >> >> You obviously still have a problem of communicating the layout if you >> have some redirects in between, but you, in theory still have this >> problem with user-defined metadata anyway (unless I'm missing >> something). >> Hmm, I think you are missing something... As far as I'm concerned we are discussing placing the KV data after the xdp_frame, and not in the XDP data_meta area (as your drawing suggests). The xdp_frame is stored at the very top of the headroom. Lorenzo's patchset is extending struct xdp_frame and now we are discussing to we can make a more flexible API for extending this. I understand that Toke confirmed this here [3]. Let me know if I missed something :-) [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ As part of designing this flexible API, we/Toke are trying hard not to tie this to a specific data area. This is a good API design, keeping it flexible enough that we can move things around should the need arise. I don't think it is viable to store this KV data in XDP data_meta area, because existing BPF-prog's already have direct memory (write) access and can change size of area, which creates too much headache with (existing) BPF-progs creating unintentional breakage for the KV store, which would then need extensive checks to handle random corruptions (slowing down KV-store code). --Jesper
On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote: > [...] > >>> There are two different use-cases for the metadata: > >>> > >>> * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > >>> few well known fields, and only XDP can access them to set them as > >>> metadata, so storing them in a struct somewhere could make sense. > >>> > >>> * Arbitrary metadata used by services. Eg a TC filter could set a field > >>> describing which service a packet is for, and that could be reused for > >>> iptables, routing, socket dispatch... > >>> Similarly we could set a "packet_id" field that uniquely identifies a > >>> packet so we can trace it throughout the network stack (through > >>> clones, encap, decap, userspace services...). > >>> The skb->mark, but with more room, and better support for sharing it. > >>> > >>> We can only know the layout ahead of time for the first one. And they're > >>> similar enough in their requirements (need to be stored somewhere in the > >>> SKB, have a way of retrieving each one individually, that it seems to > >>> make sense to use a common API). > >> > >> Why not have the following layout then? > >> > >> +---------------+-------------------+----------------------------------------+------+ > >> | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > >> +---------------+-------------------+----------------------------------------+------+ > >> ^ ^ > >> data_meta data > >> > >> You obviously still have a problem of communicating the layout if you > >> have some redirects in between, but you, in theory still have this > >> problem with user-defined metadata anyway (unless I'm missing > >> something). > >> > > Hmm, I think you are missing something... As far as I'm concerned we are > discussing placing the KV data after the xdp_frame, and not in the XDP > data_meta area (as your drawing suggests). The xdp_frame is stored at > the very top of the headroom. Lorenzo's patchset is extending struct > xdp_frame and now we are discussing to we can make a more flexible API > for extending this. I understand that Toke confirmed this here [3]. Let > me know if I missed something :-) > > [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ > > As part of designing this flexible API, we/Toke are trying hard not to > tie this to a specific data area. This is a good API design, keeping it > flexible enough that we can move things around should the need arise. +1. And if we have an API for doing this for user-defined metadata, it seems like we might as well use it for hardware metadata too. With something roughly like: *val get(id) set(id, *val) with pre-defined ids for hardware metadata, consumers don't need to know the layout, or where / how the data is stored. Under the hood we can implement it however we want, and change it in the future. I was initially thinking we could store hardware metadata the same way as user defined metadata, but Toke and Lorenzo seem to prefer storing it in a fixed struct.
On 04/10/2024 15.55, Arthur Fabre wrote: > On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote: >> [...] >>>>> There are two different use-cases for the metadata: >>>>> >>>>> * "Hardware" metadata (like the hash, rx_timestamp...). There are only a >>>>> few well known fields, and only XDP can access them to set them as >>>>> metadata, so storing them in a struct somewhere could make sense. >>>>> >>>>> * Arbitrary metadata used by services. Eg a TC filter could set a field >>>>> describing which service a packet is for, and that could be reused for >>>>> iptables, routing, socket dispatch... >>>>> Similarly we could set a "packet_id" field that uniquely identifies a >>>>> packet so we can trace it throughout the network stack (through >>>>> clones, encap, decap, userspace services...). >>>>> The skb->mark, but with more room, and better support for sharing it. >>>>> >>>>> We can only know the layout ahead of time for the first one. And they're >>>>> similar enough in their requirements (need to be stored somewhere in the >>>>> SKB, have a way of retrieving each one individually, that it seems to >>>>> make sense to use a common API). >>>> >>>> Why not have the following layout then? >>>> >>>> +---------------+-------------------+----------------------------------------+------+ >>>> | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | >>>> +---------------+-------------------+----------------------------------------+------+ >>>> ^ ^ >>>> data_meta data >>>> >>>> You obviously still have a problem of communicating the layout if you >>>> have some redirects in between, but you, in theory still have this >>>> problem with user-defined metadata anyway (unless I'm missing >>>> something). >>>> >> >> Hmm, I think you are missing something... As far as I'm concerned we are >> discussing placing the KV data after the xdp_frame, and not in the XDP >> data_meta area (as your drawing suggests). The xdp_frame is stored at >> the very top of the headroom. Lorenzo's patchset is extending struct >> xdp_frame and now we are discussing to we can make a more flexible API >> for extending this. I understand that Toke confirmed this here [3]. Let >> me know if I missed something :-) >> >> [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ >> >> As part of designing this flexible API, we/Toke are trying hard not to >> tie this to a specific data area. This is a good API design, keeping it >> flexible enough that we can move things around should the need arise. > > +1. And if we have an API for doing this for user-defined metadata, it > seems like we might as well use it for hardware metadata too. > > With something roughly like: > > *val get(id) > > set(id, *val) > > with pre-defined ids for hardware metadata, consumers don't need to know > the layout, or where / how the data is stored. > > Under the hood we can implement it however we want, and change it in the > future. > > I was initially thinking we could store hardware metadata the same way > as user defined metadata, but Toke and Lorenzo seem to prefer storing it > in a fixed struct. If the API hide the actual location then we can always move things around, later. If your popcnt approach is fast enough, then IMO we don't need a fixed struct for hardware metadata. --Jesper
On Oct 04, Jesper Dangaard Brouer wrote: > > > On 04/10/2024 15.55, Arthur Fabre wrote: > > On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote: > > > [...] > > > > > > There are two different use-cases for the metadata: > > > > > > > > > > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > > > > > > few well known fields, and only XDP can access them to set them as > > > > > > metadata, so storing them in a struct somewhere could make sense. > > > > > > > > > > > > * Arbitrary metadata used by services. Eg a TC filter could set a field > > > > > > describing which service a packet is for, and that could be reused for > > > > > > iptables, routing, socket dispatch... > > > > > > Similarly we could set a "packet_id" field that uniquely identifies a > > > > > > packet so we can trace it throughout the network stack (through > > > > > > clones, encap, decap, userspace services...). > > > > > > The skb->mark, but with more room, and better support for sharing it. > > > > > > > > > > > > We can only know the layout ahead of time for the first one. And they're > > > > > > similar enough in their requirements (need to be stored somewhere in the > > > > > > SKB, have a way of retrieving each one individually, that it seems to > > > > > > make sense to use a common API). > > > > > > > > > > Why not have the following layout then? > > > > > > > > > > +---------------+-------------------+----------------------------------------+------+ > > > > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > > > > > +---------------+-------------------+----------------------------------------+------+ > > > > > ^ ^ > > > > > data_meta data > > > > > > > > > > You obviously still have a problem of communicating the layout if you > > > > > have some redirects in between, but you, in theory still have this > > > > > problem with user-defined metadata anyway (unless I'm missing > > > > > something). > > > > > > > > > > > Hmm, I think you are missing something... As far as I'm concerned we are > > > discussing placing the KV data after the xdp_frame, and not in the XDP > > > data_meta area (as your drawing suggests). The xdp_frame is stored at > > > the very top of the headroom. Lorenzo's patchset is extending struct > > > xdp_frame and now we are discussing to we can make a more flexible API > > > for extending this. I understand that Toke confirmed this here [3]. Let > > > me know if I missed something :-) > > > > > > [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ > > > > > > As part of designing this flexible API, we/Toke are trying hard not to > > > tie this to a specific data area. This is a good API design, keeping it > > > flexible enough that we can move things around should the need arise. > > > > +1. And if we have an API for doing this for user-defined metadata, it > > seems like we might as well use it for hardware metadata too. > > > > With something roughly like: > > > > *val get(id) > > > > set(id, *val) > > > > with pre-defined ids for hardware metadata, consumers don't need to know > > the layout, or where / how the data is stored. > > > > Under the hood we can implement it however we want, and change it in the > > future. > > > > I was initially thinking we could store hardware metadata the same way > > as user defined metadata, but Toke and Lorenzo seem to prefer storing it > > in a fixed struct. > > If the API hide the actual location then we can always move things > around, later. If your popcnt approach is fast enough, then IMO we > don't need a fixed struct for hardware metadata. +1. I am fine with the KV approach for nic metadata as well if it is fast enough. If you want I can modify my series to use kfunc sto store data after xdp_frame and then you can plug the KV encoding. What do you think? Up to you. Regards, Lorenzo > > --Jesper
On Fri Oct 4, 2024 at 4:18 PM CEST, Lorenzo Bianconi wrote: > On Oct 04, Jesper Dangaard Brouer wrote: > > On 04/10/2024 15.55, Arthur Fabre wrote: > > > On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote: > > > > [...] > > > > > > > There are two different use-cases for the metadata: > > > > > > > > > > > > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > > > > > > > few well known fields, and only XDP can access them to set them as > > > > > > > metadata, so storing them in a struct somewhere could make sense. > > > > > > > > > > > > > > * Arbitrary metadata used by services. Eg a TC filter could set a field > > > > > > > describing which service a packet is for, and that could be reused for > > > > > > > iptables, routing, socket dispatch... > > > > > > > Similarly we could set a "packet_id" field that uniquely identifies a > > > > > > > packet so we can trace it throughout the network stack (through > > > > > > > clones, encap, decap, userspace services...). > > > > > > > The skb->mark, but with more room, and better support for sharing it. > > > > > > > > > > > > > > We can only know the layout ahead of time for the first one. And they're > > > > > > > similar enough in their requirements (need to be stored somewhere in the > > > > > > > SKB, have a way of retrieving each one individually, that it seems to > > > > > > > make sense to use a common API). > > > > > > > > > > > > Why not have the following layout then? > > > > > > > > > > > > +---------------+-------------------+----------------------------------------+------+ > > > > > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > > > > > > +---------------+-------------------+----------------------------------------+------+ > > > > > > ^ ^ > > > > > > data_meta data > > > > > > > > > > > > You obviously still have a problem of communicating the layout if you > > > > > > have some redirects in between, but you, in theory still have this > > > > > > problem with user-defined metadata anyway (unless I'm missing > > > > > > something). > > > > > > > > > > > > > > Hmm, I think you are missing something... As far as I'm concerned we are > > > > discussing placing the KV data after the xdp_frame, and not in the XDP > > > > data_meta area (as your drawing suggests). The xdp_frame is stored at > > > > the very top of the headroom. Lorenzo's patchset is extending struct > > > > xdp_frame and now we are discussing to we can make a more flexible API > > > > for extending this. I understand that Toke confirmed this here [3]. Let > > > > me know if I missed something :-) > > > > > > > > [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ > > > > > > > > As part of designing this flexible API, we/Toke are trying hard not to > > > > tie this to a specific data area. This is a good API design, keeping it > > > > flexible enough that we can move things around should the need arise. > > > > > > +1. And if we have an API for doing this for user-defined metadata, it > > > seems like we might as well use it for hardware metadata too. > > > > > > With something roughly like: > > > > > > *val get(id) > > > > > > set(id, *val) > > > > > > with pre-defined ids for hardware metadata, consumers don't need to know > > > the layout, or where / how the data is stored. > > > > > > Under the hood we can implement it however we want, and change it in the > > > future. > > > > > > I was initially thinking we could store hardware metadata the same way > > > as user defined metadata, but Toke and Lorenzo seem to prefer storing it > > > in a fixed struct. > > > > If the API hide the actual location then we can always move things > > around, later. If your popcnt approach is fast enough, then IMO we > > don't need a fixed struct for hardware metadata. > > +1. I am fine with the KV approach for nic metadata as well if it is fast enough. Great! That's simpler. I should have something for Jesper to benchmark on Monday. > If you want I can modify my series to use kfunc sto store data after xdp_frame > and then you can plug the KV encoding. What do you think? Up to you. Thanks for the offer! That works for me :)
On 10/03, Daniel Xu wrote: > Hi Stan, > > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote: > > On 10/03, Arthur Fabre wrote: > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: > > > > On 10/02, Toke Høiland-Jørgensen wrote: > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > > > > > > > > > On 10/01, Toke Høiland-Jørgensen wrote: > > > > > >> Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > > >> > > > > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > > > > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > > >> >> > > > > > > > >> >> > > >> > We could combine such a registration API with your header format, so > > > > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys > > > > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header). > > > > > >> >> > > >> > This would basically amount to moving the "service config file" into the > > > > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely > > > > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon > > > > > >> >> > > >> > for BPF management have shown). > > > > > >> >> > > >> > > > > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global > > > > > >> >> > > >> registry to enforce that the key has been registered beforehand? > > > > > >> >> > > >> > > > > > >> >> > > >> > > > > > > >> >> > > >> > -Toke > > > > > >> >> > > >> > > > > > >> >> > > >> Thanks for all the feedback! > > > > > >> >> > > > > > > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls > > > > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > > > > >> >> > > > each packet, right? > > > > > >> >> > > > > > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > > > > >> >> > > a global registry for offsets would sidestep this, but have the > > > > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think > > > > > >> >> > > the memmove() suggestion will probably lead to the least pain. > > > > > >> >> > > > > > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed > > > > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The > > > > > >> >> > > only drawback of doing that is that we statically reserve a bit of > > > > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least > > > > > >> >> > > not until this becomes to popular that the space starts to be contended; > > > > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)). > > > > > >> >> > > > > > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the > > > > > >> >> > impact on performances (in the worst case??) > > > > > >> >> > > > > > >> >> If drivers are responsible for populating the hardware metadata before > > > > > >> >> XDP, we could make sure drivers set the fields in order to avoid any > > > > > >> >> memove() (and maybe even provide a helper to ensure this?). > > > > > >> > > > > > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC > > > > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > > > > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > > > > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?). > > > > > >> > In this case kfunc calling order makes a difference. > > > > > >> > We can think even to add single kfunc to store all the info for all the NIC > > > > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we > > > > > >> > are losing kfunc versatility. > > > > > >> > > > > > >> Yes, I agree we should have separate kfuncs for each metadata field. > > > > > >> Which means it makes a lot of sense to just use the same setter API that > > > > > >> we use for the user-registered metadata fields, but using reserved keys. > > > > > >> So something like: > > > > > >> > > > > > >> #define BPF_METADATA_HW_HASH BIT(60) > > > > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61) > > > > > >> #define BPF_METADATA_HW_VLAN BIT(62) > > > > > >> #define BPF_METADATA_RESERVED (0xffff << 48) > > > > > >> > > > > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); > > > > > >> > > > > > >> > > > > > >> As for the internal representation, we can just have the kfunc do > > > > > >> something like: > > > > > >> > > > > > >> int bpf_packet_metadata_set(field_id, value) { > > > > > >> switch(field_id) { > > > > > >> case BPF_METADATA_HW_HASH: > > > > > >> pkt->xdp_hw_meta.hash = value; > > > > > >> break; > > > > > >> [...] > > > > > >> default: > > > > > >> /* do the key packing thing */ > > > > > >> } > > > > > >> } > > > > > >> > > > > > >> > > > > > >> that way the order of setting the HW fields doesn't matter, only the > > > > > >> user-defined metadata. > > > > > > > > > > > > Can you expand on why we need the flexibility of picking the metadata fields > > > > > > here? Presumably we are talking about the use-cases where the XDP program > > > > > > is doing redirect/pass and it doesn't really know who's the final > > > > > > consumer is (might be another xdp program or might be the xdp->skb > > > > > > kernel case), so the only sensible option here seems to be store everything? > > > > > > > > > > For the same reason that we have separate kfuncs for each metadata field > > > > > when getting it from the driver: XDP programs should have the > > > > > flexibility to decide which pieces of metadata they need, and skip the > > > > > overhead of stuff that is not needed. > > > > > > > > > > For instance, say an XDP program knows that nothing in the system uses > > > > > timestamps; in that case, it can skip both the getter and the setter > > > > > call for timestamps. > > > > Original RFC is talking about XDP -> XDP_REDIRECT -> skb use-case, > > right? For this we pretty much know what kind of metadata we want to > > preserve, so why not ship it in the existing metadata area and have > > a kfunc that the xdp program will call prior to doing xdp_redirect? > > This kfunc can do exactly what you're suggesting - skip the timestamp > > if we know that the timestamping is off. > > > > Or have we moved to discussing some other use-cases? What am I missing > > about the need for some other new mechanism? > > > > > > But doesn't it put us in the same place? Where the first (native) xdp program > > > > needs to know which metadata the final consumer wants. At this point > > > > why not propagate metadata layout as well? > > > > > > > > (or maybe I'm still missing what exact use-case we are trying to solve) > > > > > > There are two different use-cases for the metadata: > > > > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > > > few well known fields, and only XDP can access them to set them as > > > metadata, so storing them in a struct somewhere could make sense. > > > > > > * Arbitrary metadata used by services. Eg a TC filter could set a field > > > describing which service a packet is for, and that could be reused for > > > iptables, routing, socket dispatch... > > > Similarly we could set a "packet_id" field that uniquely identifies a > > > packet so we can trace it throughout the network stack (through > > > clones, encap, decap, userspace services...). > > > The skb->mark, but with more room, and better support for sharing it. > > > > > > We can only know the layout ahead of time for the first one. And they're > > > similar enough in their requirements (need to be stored somewhere in the > > > SKB, have a way of retrieving each one individually, that it seems to > > > make sense to use a common API). > > > > Why not have the following layout then? > > > > +---------------+-------------------+----------------------------------------+------+ > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > > +---------------+-------------------+----------------------------------------+------+ > > ^ ^ > > data_meta data > > > > You obviously still have a problem of communicating the layout if you > > have some redirects in between, but you, in theory still have this > > problem with user-defined metadata anyway (unless I'm missing > > something). > > > > > > > I suppose we *could* support just a single call to set the skb meta, > > > > > like: > > > > > > > > > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data); > > > > > > > > > > ...but in that case, we'd need to support some fields being unset > > > > > anyway, and the program would have to populate the struct on the stack > > > > > before performing the call. So it seems simpler to just have symmetry > > > > > between the get (from HW) and set side? :) > > > > > > > > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store > > > > the metadata somewhere in xdp_md directly? (also presumably by > > > > reusing most of the existing kfuncs/xmo_xxx helpers) > > > > > > If we store it in xdp_md, the metadata won't be available higher up the > > > stack (or am I missing something?). I think one of the goals is to let > > > things other than XDP access it (maybe even the network stack itself?). > > > > IIRC, xdp metadata gets copied to skb metadata, so it does propagate. > > Although, it might have a detrimental effect on the gro, but I'm > > assuming that is something that can be fixed separately. > > I was thinking about this today so I'm glad you brought it up. > > IIUC putting unique data in the metadata area will prevent GRO from > working. I wonder if as a part of this work there's a cleaner way to > indicate to XDP or GRO engine that some metadata should be ignored for > coalescing purposes. Otherwise the final XDP prog before GRO would need > to memset() all the relevant bytes to 0 (assuming that even works). I'm assuming it is that way because there has to be a conscious decision (TBD somewhere) about how to merge the metadata. IOW, there needs to be some definition of 'ignored for coalescing purposes'. Do we ignore the old metadata (the one that's already in the gro queue) or the new metadata (in the skb)? What if there is a timestamp in the metadata? In this case, depending on what you ignore, you get a different timestamp.
On 10/04, Jesper Dangaard Brouer wrote: > > > On 04/10/2024 04.13, Daniel Xu wrote: > > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote: > > > On 10/03, Arthur Fabre wrote: > > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: > > > > > On 10/02, Toke Høiland-Jørgensen wrote: > > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > > > > > > > > > > > On 10/01, Toke Høiland-Jørgensen wrote: > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > > > > > > > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls > > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > > > > > > > > > > > > each packet, right? > > > > > > > > > > > > > > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > I really like the *compact* Key-Value (KV) store idea from Arthur. > - The question is it is fast enough? > > I've promised Arthur to XDP micro-benchmark this, if he codes this up to > be usable in the XDP code path. Listening to the LPC recording I heard > that Alexei also saw potential and other use-case for this kind of > fast-and-compact KV approach. > > I have high hopes for the performance, as Arthur uses POPCNT instruction > which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1 > and Reciprocal throughput 0.25. > > [1] https://www.agner.org/optimize/blog/read.php?i=853#848 > [2] https://www.agner.org/optimize/instruction_tables.pdf > > [...] > > > > There are two different use-cases for the metadata: > > > > > > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > > > > few well known fields, and only XDP can access them to set them as > > > > metadata, so storing them in a struct somewhere could make sense. > > > > > > > > * Arbitrary metadata used by services. Eg a TC filter could set a field > > > > describing which service a packet is for, and that could be reused for > > > > iptables, routing, socket dispatch... > > > > Similarly we could set a "packet_id" field that uniquely identifies a > > > > packet so we can trace it throughout the network stack (through > > > > clones, encap, decap, userspace services...). > > > > The skb->mark, but with more room, and better support for sharing it. > > > > > > > > We can only know the layout ahead of time for the first one. And they're > > > > similar enough in their requirements (need to be stored somewhere in the > > > > SKB, have a way of retrieving each one individually, that it seems to > > > > make sense to use a common API). > > > > > > Why not have the following layout then? > > > > > > +---------------+-------------------+----------------------------------------+------+ > > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > > > +---------------+-------------------+----------------------------------------+------+ > > > ^ ^ > > > data_meta data > > > > > > You obviously still have a problem of communicating the layout if you > > > have some redirects in between, but you, in theory still have this > > > problem with user-defined metadata anyway (unless I'm missing > > > something). > > > > > Hmm, I think you are missing something... As far as I'm concerned we are > discussing placing the KV data after the xdp_frame, and not in the XDP > data_meta area (as your drawing suggests). The xdp_frame is stored at > the very top of the headroom. Lorenzo's patchset is extending struct > xdp_frame and now we are discussing to we can make a more flexible API > for extending this. I understand that Toke confirmed this here [3]. Let > me know if I missed something :-) > > [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ > > As part of designing this flexible API, we/Toke are trying hard not to > tie this to a specific data area. This is a good API design, keeping it > flexible enough that we can move things around should the need arise. > > I don't think it is viable to store this KV data in XDP data_meta area, > because existing BPF-prog's already have direct memory (write) access > and can change size of area, which creates too much headache with > (existing) BPF-progs creating unintentional breakage for the KV store, > which would then need extensive checks to handle random corruptions > (slowing down KV-store code). Yes, I'm definitely missing the bigger picture. If we want to have a global metadata registry in the kernel, why can't it be built on top of the existing area? Have some control api to define the layout and some new api to attach the layout id to the xdp_frame. And one of those layouts might be the xdp->skb one (although, for performance sake, still makes more sense to special case xdp->skb one rather than asking the kernel to parse the kv layout definition). But I'm happy to wait for the v2 and re-read the cover letter :-)
Stanislav Fomichev <stfomichev@gmail.com> writes: > On 10/04, Jesper Dangaard Brouer wrote: >> >> >> On 04/10/2024 04.13, Daniel Xu wrote: >> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote: >> > > On 10/03, Arthur Fabre wrote: >> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: >> > > > > On 10/02, Toke Høiland-Jørgensen wrote: >> > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes: >> > > > > > >> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote: >> > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: >> > > > > > > > >> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: >> > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: >> > > > > > > > > > > > >> [...] >> > > > > > > > > > > > > >> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its >> > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls >> > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for >> > > > > > > > > > > > > each packet, right? >> > > > > > > > > > > > >> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using >> >> I really like the *compact* Key-Value (KV) store idea from Arthur. >> - The question is it is fast enough? >> >> I've promised Arthur to XDP micro-benchmark this, if he codes this up to >> be usable in the XDP code path. Listening to the LPC recording I heard >> that Alexei also saw potential and other use-case for this kind of >> fast-and-compact KV approach. >> >> I have high hopes for the performance, as Arthur uses POPCNT instruction >> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1 >> and Reciprocal throughput 0.25. >> >> [1] https://www.agner.org/optimize/blog/read.php?i=853#848 >> [2] https://www.agner.org/optimize/instruction_tables.pdf >> >> [...] >> > > > There are two different use-cases for the metadata: >> > > > >> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a >> > > > few well known fields, and only XDP can access them to set them as >> > > > metadata, so storing them in a struct somewhere could make sense. >> > > > >> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field >> > > > describing which service a packet is for, and that could be reused for >> > > > iptables, routing, socket dispatch... >> > > > Similarly we could set a "packet_id" field that uniquely identifies a >> > > > packet so we can trace it throughout the network stack (through >> > > > clones, encap, decap, userspace services...). >> > > > The skb->mark, but with more room, and better support for sharing it. >> > > > >> > > > We can only know the layout ahead of time for the first one. And they're >> > > > similar enough in their requirements (need to be stored somewhere in the >> > > > SKB, have a way of retrieving each one individually, that it seems to >> > > > make sense to use a common API). >> > > >> > > Why not have the following layout then? >> > > >> > > +---------------+-------------------+----------------------------------------+------+ >> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | >> > > +---------------+-------------------+----------------------------------------+------+ >> > > ^ ^ >> > > data_meta data >> > > >> > > You obviously still have a problem of communicating the layout if you >> > > have some redirects in between, but you, in theory still have this >> > > problem with user-defined metadata anyway (unless I'm missing >> > > something). >> > > >> >> Hmm, I think you are missing something... As far as I'm concerned we are >> discussing placing the KV data after the xdp_frame, and not in the XDP >> data_meta area (as your drawing suggests). The xdp_frame is stored at >> the very top of the headroom. Lorenzo's patchset is extending struct >> xdp_frame and now we are discussing to we can make a more flexible API >> for extending this. I understand that Toke confirmed this here [3]. Let >> me know if I missed something :-) >> >> [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ >> >> As part of designing this flexible API, we/Toke are trying hard not to >> tie this to a specific data area. This is a good API design, keeping it >> flexible enough that we can move things around should the need arise. >> >> I don't think it is viable to store this KV data in XDP data_meta area, >> because existing BPF-prog's already have direct memory (write) access >> and can change size of area, which creates too much headache with >> (existing) BPF-progs creating unintentional breakage for the KV store, >> which would then need extensive checks to handle random corruptions >> (slowing down KV-store code). > > Yes, I'm definitely missing the bigger picture. If we want to have a global > metadata registry in the kernel, why can't it be built on top of the existing > area? Because we have no way of preventing existing XDP programs from overwriting (corrupting) the area using the xdp_adjust_meta() API and data_meta field. But in a sense the *memory area* is shared between the two APIs, in the sense that they both use the headroom before the packet data, just from opposite ends. So if you store lots of data using the new KV API, that space will no longer be available for xdp_adjust_{head,meta}. But the kernel can enforce this so we don't get programs corrupting the KV format. -Toke
On 10/06, Toke Høiland-Jørgensen wrote: > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > On 10/04, Jesper Dangaard Brouer wrote: > >> > >> > >> On 04/10/2024 04.13, Daniel Xu wrote: > >> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote: > >> > > On 10/03, Arthur Fabre wrote: > >> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: > >> > > > > On 10/02, Toke Høiland-Jørgensen wrote: > >> > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes: > >> > > > > > > >> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote: > >> > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > >> > > > > > > > > >> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > >> > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > >> > > > > > > > > > > > > >> [...] > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its > >> > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls > >> > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > >> > > > > > > > > > > > > each packet, right? > >> > > > > > > > > > > > > >> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > >> > >> I really like the *compact* Key-Value (KV) store idea from Arthur. > >> - The question is it is fast enough? > >> > >> I've promised Arthur to XDP micro-benchmark this, if he codes this up to > >> be usable in the XDP code path. Listening to the LPC recording I heard > >> that Alexei also saw potential and other use-case for this kind of > >> fast-and-compact KV approach. > >> > >> I have high hopes for the performance, as Arthur uses POPCNT instruction > >> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1 > >> and Reciprocal throughput 0.25. > >> > >> [1] https://www.agner.org/optimize/blog/read.php?i=853#848 > >> [2] https://www.agner.org/optimize/instruction_tables.pdf > >> > >> [...] > >> > > > There are two different use-cases for the metadata: > >> > > > > >> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > >> > > > few well known fields, and only XDP can access them to set them as > >> > > > metadata, so storing them in a struct somewhere could make sense. > >> > > > > >> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field > >> > > > describing which service a packet is for, and that could be reused for > >> > > > iptables, routing, socket dispatch... > >> > > > Similarly we could set a "packet_id" field that uniquely identifies a > >> > > > packet so we can trace it throughout the network stack (through > >> > > > clones, encap, decap, userspace services...). > >> > > > The skb->mark, but with more room, and better support for sharing it. > >> > > > > >> > > > We can only know the layout ahead of time for the first one. And they're > >> > > > similar enough in their requirements (need to be stored somewhere in the > >> > > > SKB, have a way of retrieving each one individually, that it seems to > >> > > > make sense to use a common API). > >> > > > >> > > Why not have the following layout then? > >> > > > >> > > +---------------+-------------------+----------------------------------------+------+ > >> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > >> > > +---------------+-------------------+----------------------------------------+------+ > >> > > ^ ^ > >> > > data_meta data > >> > > > >> > > You obviously still have a problem of communicating the layout if you > >> > > have some redirects in between, but you, in theory still have this > >> > > problem with user-defined metadata anyway (unless I'm missing > >> > > something). > >> > > > >> > >> Hmm, I think you are missing something... As far as I'm concerned we are > >> discussing placing the KV data after the xdp_frame, and not in the XDP > >> data_meta area (as your drawing suggests). The xdp_frame is stored at > >> the very top of the headroom. Lorenzo's patchset is extending struct > >> xdp_frame and now we are discussing to we can make a more flexible API > >> for extending this. I understand that Toke confirmed this here [3]. Let > >> me know if I missed something :-) > >> > >> [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ > >> > >> As part of designing this flexible API, we/Toke are trying hard not to > >> tie this to a specific data area. This is a good API design, keeping it > >> flexible enough that we can move things around should the need arise. > >> > >> I don't think it is viable to store this KV data in XDP data_meta area, > >> because existing BPF-prog's already have direct memory (write) access > >> and can change size of area, which creates too much headache with > >> (existing) BPF-progs creating unintentional breakage for the KV store, > >> which would then need extensive checks to handle random corruptions > >> (slowing down KV-store code). > > > > Yes, I'm definitely missing the bigger picture. If we want to have a global > > metadata registry in the kernel, why can't it be built on top of the existing > > area? > > Because we have no way of preventing existing XDP programs from > overwriting (corrupting) the area using the xdp_adjust_meta() API and > data_meta field. True, but this can be solved with some new BPF_F_XDP_HAS_FRAGS-like flag (which can reject loading if there is some incompatibility)? Even in the new KV-metadata world, 2+ programs still need to be aware of the new method to work correctly. But I do see your point that it's better to not apply any metadata than apply something that's corrupt/overridden. > But in a sense the *memory area* is shared between the two APIs, in the > sense that they both use the headroom before the packet data, just from > opposite ends. So if you store lots of data using the new KV API, that > space will no longer be available for xdp_adjust_{head,meta}. But the > kernel can enforce this so we don't get programs corrupting the KV > format. Ack, let's see how it shapes out. My main concern comes from the growing api surface where for af_xdp it's one mechanism, for xdp redirect it's another. And for Jakub's consumption from userspace it's gonna be another special case probably (to read it out from the headroom head)? Idk, maybe it's fine as long as each case is clearly documented.
On Mon Oct 7, 2024 at 8:48 PM CEST, Stanislav Fomichev wrote: > On 10/06, Toke Høiland-Jørgensen wrote: > > Stanislav Fomichev <stfomichev@gmail.com> writes: > > > > > On 10/04, Jesper Dangaard Brouer wrote: > > >> > > >> > > >> On 04/10/2024 04.13, Daniel Xu wrote: > > >> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote: > > >> > > On 10/03, Arthur Fabre wrote: > > >> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote: > > >> > > > > On 10/02, Toke Høiland-Jørgensen wrote: > > >> > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes: > > >> > > > > > > > >> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote: > > >> > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > >> > > > > > > > > > >> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > >> > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes: > > >> > > > > > > > > > > > > > >> [...] > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its > > >> > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls > > >> > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > >> > > > > > > > > > > > > each packet, right? > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > >> > > >> I really like the *compact* Key-Value (KV) store idea from Arthur. > > >> - The question is it is fast enough? > > >> > > >> I've promised Arthur to XDP micro-benchmark this, if he codes this up to > > >> be usable in the XDP code path. Listening to the LPC recording I heard > > >> that Alexei also saw potential and other use-case for this kind of > > >> fast-and-compact KV approach. > > >> > > >> I have high hopes for the performance, as Arthur uses POPCNT instruction > > >> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1 > > >> and Reciprocal throughput 0.25. > > >> > > >> [1] https://www.agner.org/optimize/blog/read.php?i=853#848 > > >> [2] https://www.agner.org/optimize/instruction_tables.pdf > > >> > > >> [...] > > >> > > > There are two different use-cases for the metadata: > > >> > > > > > >> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a > > >> > > > few well known fields, and only XDP can access them to set them as > > >> > > > metadata, so storing them in a struct somewhere could make sense. > > >> > > > > > >> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field > > >> > > > describing which service a packet is for, and that could be reused for > > >> > > > iptables, routing, socket dispatch... > > >> > > > Similarly we could set a "packet_id" field that uniquely identifies a > > >> > > > packet so we can trace it throughout the network stack (through > > >> > > > clones, encap, decap, userspace services...). > > >> > > > The skb->mark, but with more room, and better support for sharing it. > > >> > > > > > >> > > > We can only know the layout ahead of time for the first one. And they're > > >> > > > similar enough in their requirements (need to be stored somewhere in the > > >> > > > SKB, have a way of retrieving each one individually, that it seems to > > >> > > > make sense to use a common API). > > >> > > > > >> > > Why not have the following layout then? > > >> > > > > >> > > +---------------+-------------------+----------------------------------------+------+ > > >> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data | > > >> > > +---------------+-------------------+----------------------------------------+------+ > > >> > > ^ ^ > > >> > > data_meta data > > >> > > > > >> > > You obviously still have a problem of communicating the layout if you > > >> > > have some redirects in between, but you, in theory still have this > > >> > > problem with user-defined metadata anyway (unless I'm missing > > >> > > something). > > >> > > > > >> > > >> Hmm, I think you are missing something... As far as I'm concerned we are > > >> discussing placing the KV data after the xdp_frame, and not in the XDP > > >> data_meta area (as your drawing suggests). The xdp_frame is stored at > > >> the very top of the headroom. Lorenzo's patchset is extending struct > > >> xdp_frame and now we are discussing to we can make a more flexible API > > >> for extending this. I understand that Toke confirmed this here [3]. Let > > >> me know if I missed something :-) > > >> > > >> [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/ > > >> > > >> As part of designing this flexible API, we/Toke are trying hard not to > > >> tie this to a specific data area. This is a good API design, keeping it > > >> flexible enough that we can move things around should the need arise. > > >> > > >> I don't think it is viable to store this KV data in XDP data_meta area, > > >> because existing BPF-prog's already have direct memory (write) access > > >> and can change size of area, which creates too much headache with > > >> (existing) BPF-progs creating unintentional breakage for the KV store, > > >> which would then need extensive checks to handle random corruptions > > >> (slowing down KV-store code). > > > > > > Yes, I'm definitely missing the bigger picture. If we want to have a global > > > metadata registry in the kernel, why can't it be built on top of the existing > > > area? > > > > Because we have no way of preventing existing XDP programs from > > overwriting (corrupting) the area using the xdp_adjust_meta() API and > > data_meta field. > > True, but this can be solved with some new BPF_F_XDP_HAS_FRAGS-like > flag (which can reject loading if there is some incompatibility)? > Even in the new KV-metadata world, 2+ programs still need to be > aware of the new method to work correctly. But I do see your point > that it's better to not apply any metadata than apply something > that's corrupt/overridden. Currently the new KV-metadata will be tied to XDP, because most NICs only reserve enough headroom if an XDP program is attached. But longer-term, I'm hoping to lift this restriction to let users not using XDP (eg using TC only, or other hook points) use the KV metadata too. Enabling it with an XDP flag would make that hard. We also want to store the new KV metadata at the start of the headroom (right after xdp_frame) so that we don't have to move it for every xdp_adjust_head() call. That makes it very easy for them to coexist, it's just a few bounds checks when we grow each one. > > But in a sense the *memory area* is shared between the two APIs, in the > > sense that they both use the headroom before the packet data, just from > > opposite ends. So if you store lots of data using the new KV API, that > > space will no longer be available for xdp_adjust_{head,meta}. But the > > kernel can enforce this so we don't get programs corrupting the KV > > format. > > Ack, let's see how it shapes out. My main concern comes from the > growing api surface where for af_xdp it's one mechanism, for xdp > redirect it's another. And for Jakub's consumption from userspace > it's gonna be another special case probably (to read it out from the > headroom head)? Idk, maybe it's fine as long as each case is clearly > documented. You're right, there's going to be relatively big API surface. Hopefully the APIs should all be very similar, and we'll document them.