diff mbox series

[RFCv2,bpf-next,17/18] xsk: AF_XDP xdp-hints support in desc options

Message ID 166256558657.1434226.7390735974413846384.stgit@firesoul (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series XDP-hints: XDP gaining access to HW offload hints via BTF | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 pending Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-2 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count fail Series longer than 15 patches (and no cover letter)
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 41 this patch: 41
netdev/cc_maintainers warning 9 maintainers not CCed: edumazet@google.com john.fastabend@gmail.com pabeni@redhat.com jonathan.lemon@gmail.com maciej.fijalkowski@intel.com daniel@iogearbox.net kuba@kernel.org hawk@kernel.org davem@davemloft.net
netdev/build_clang success Errors and warnings before: 9 this patch: 9
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 41 this patch: 41
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 31 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jesper Dangaard Brouer Sept. 7, 2022, 3:46 p.m. UTC
From: Maryam Tahhan <mtahhan@redhat.com>

Simply set AF_XDP descriptor options to XDP flags.

Jesper: Will this really be acceptable by AF_XDP maintainers?

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
---
 include/uapi/linux/if_xdp.h |    2 +-
 net/xdp/xsk.c               |    2 +-
 net/xdp/xsk_queue.h         |    3 ++-
 3 files changed, 4 insertions(+), 3 deletions(-)

Comments

Magnus Karlsson Sept. 8, 2022, 8:06 a.m. UTC | #1
On Wed, Sep 7, 2022 at 5:48 PM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> From: Maryam Tahhan <mtahhan@redhat.com>
>
> Simply set AF_XDP descriptor options to XDP flags.
>
> Jesper: Will this really be acceptable by AF_XDP maintainers?

Maryam, you guessed correctly that dedicating all these options bits
for a single feature will not be ok :-). E.g., I want one bit for the
AF_XDP multi-buffer support and who knows what other uses there might
be for this options field in the future. Let us try to solve this in
some other way. Here are some suggestions, all with their pros and
cons.

* Put this feature flag at a known place in the metadata area, for
example just before the BTF ID. No need to fill this in if you are not
redirecting to AF_XDP, but at a redirect to AF_XDP, the XDP flags are
copied into this u32 in the metadata area so that user-space can
consume it. Will cost 4 bytes of the metadata area though.

* Instead encode this information into each metadata entry in the
metadata area, in some way so that a flags field is not needed (-1
signifies not valid, or whatever happens to make sense). This has the
drawback that the user might have to look at a large number of entries
just to find out there is nothing valid to read. To alleviate this, it
could be combined with the next suggestion.

* Dedicate one bit in the options field to indicate that there is at
least one valid metadata entry in the metadata area. This could be
combined with the two approaches above. However, depending on what
metadata you have enabled, this bit might be pointless. If some
metadata is always valid, then it serves no purpose. But it might if
all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
on one packet out of one thousand.

> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
> ---
>  include/uapi/linux/if_xdp.h |    2 +-
>  net/xdp/xsk.c               |    2 +-
>  net/xdp/xsk_queue.h         |    3 ++-
>  3 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index a78a8096f4ce..9335b56474e7 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -103,7 +103,7 @@ struct xdp_options {
>  struct xdp_desc {
>         __u64 addr;
>         __u32 len;
> -       __u32 options;
> +       __u32 options; /* set to the values of xdp_hints_flags*/
>  };
>
>  /* UMEM descriptor is __u64 */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 5b4ce6ba1bc7..32095d78f06b 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -141,7 +141,7 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>         int err;
>
>         addr = xp_get_handle(xskb);
> -       err = xskq_prod_reserve_desc(xs->rx, addr, len);
> +       err = xskq_prod_reserve_desc(xs->rx, addr, len, xdp->flags);
>         if (err) {
>                 xs->rx_queue_full++;
>                 return err;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index fb20bf7207cf..7a66f082f97e 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -368,7 +368,7 @@ static inline u32 xskq_prod_reserve_addr_batch(struct xsk_queue *q, struct xdp_d
>  }
>
>  static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
> -                                        u64 addr, u32 len)
> +                                        u64 addr, u32 len, u32 flags)
>  {
>         struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>         u32 idx;
> @@ -380,6 +380,7 @@ static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
>         idx = q->cached_prod++ & q->ring_mask;
>         ring->desc[idx].addr = addr;
>         ring->desc[idx].len = len;
> +       ring->desc[idx].options = flags;
>
>         return 0;
>  }
>
>
Maryam Tahhan Sept. 8, 2022, 10:10 a.m. UTC | #2
On 08/09/2022 09:06, Magnus Karlsson wrote:
> On Wed, Sep 7, 2022 at 5:48 PM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>
>> From: Maryam Tahhan <mtahhan@redhat.com>
>>
>> Simply set AF_XDP descriptor options to XDP flags.
>>
>> Jesper: Will this really be acceptable by AF_XDP maintainers?
> 
> Maryam, you guessed correctly that dedicating all these options bits
> for a single feature will not be ok :-). E.g., I want one bit for the
> AF_XDP multi-buffer support and who knows what other uses there might
> be for this options field in the future. Let us try to solve this in
> some other way. Here are some suggestions, all with their pros and
> cons.
> 

TBH it was Jespers question :)

> * Put this feature flag at a known place in the metadata area, for
> example just before the BTF ID. No need to fill this in if you are not
> redirecting to AF_XDP, but at a redirect to AF_XDP, the XDP flags are
> copied into this u32 in the metadata area so that user-space can
> consume it. Will cost 4 bytes of the metadata area though.

If Jesper agrees I think this approach would make sense. Trying to
translate encodings into some other flags for AF_XDP I think will lead
to a growing set of translations as more options come along.
The other thing to be aware of is just making sure to clear/zero the 
metadata space in the buffers at some point (ideally when the descriptor 
is returned from the application) so when the buffers are used again
they are already in a "reset" state.

> 
> * Instead encode this information into each metadata entry in the
> metadata area, in some way so that a flags field is not needed (-1
> signifies not valid, or whatever happens to make sense). This has the
> drawback that the user might have to look at a large number of entries
> just to find out there is nothing valid to read. To alleviate this, it
> could be combined with the next suggestion.
> 
> * Dedicate one bit in the options field to indicate that there is at
> least one valid metadata entry in the metadata area. This could be
> combined with the two approaches above. However, depending on what
> metadata you have enabled, this bit might be pointless. If some
> metadata is always valid, then it serves no purpose. But it might if
> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
> on one packet out of one thousand.
> 
>> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
>> ---
>>   include/uapi/linux/if_xdp.h |    2 +-
>>   net/xdp/xsk.c               |    2 +-
>>   net/xdp/xsk_queue.h         |    3 ++-
>>   3 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index a78a8096f4ce..9335b56474e7 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -103,7 +103,7 @@ struct xdp_options {
>>   struct xdp_desc {
>>          __u64 addr;
>>          __u32 len;
>> -       __u32 options;
>> +       __u32 options; /* set to the values of xdp_hints_flags*/
>>   };
>>
>>   /* UMEM descriptor is __u64 */
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index 5b4ce6ba1bc7..32095d78f06b 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -141,7 +141,7 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>>          int err;
>>
>>          addr = xp_get_handle(xskb);
>> -       err = xskq_prod_reserve_desc(xs->rx, addr, len);
>> +       err = xskq_prod_reserve_desc(xs->rx, addr, len, xdp->flags);
>>          if (err) {
>>                  xs->rx_queue_full++;
>>                  return err;
>> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
>> index fb20bf7207cf..7a66f082f97e 100644
>> --- a/net/xdp/xsk_queue.h
>> +++ b/net/xdp/xsk_queue.h
>> @@ -368,7 +368,7 @@ static inline u32 xskq_prod_reserve_addr_batch(struct xsk_queue *q, struct xdp_d
>>   }
>>
>>   static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
>> -                                        u64 addr, u32 len)
>> +                                        u64 addr, u32 len, u32 flags)
>>   {
>>          struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>>          u32 idx;
>> @@ -380,6 +380,7 @@ static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
>>          idx = q->cached_prod++ & q->ring_mask;
>>          ring->desc[idx].addr = addr;
>>          ring->desc[idx].len = len;
>> +       ring->desc[idx].options = flags;
>>
>>          return 0;
>>   }
>>
>>
>
Jesper Dangaard Brouer Sept. 8, 2022, 3:04 p.m. UTC | #3
On 08/09/2022 12.10, Maryam Tahhan wrote:
> On 08/09/2022 09:06, Magnus Karlsson wrote:
>> On Wed, Sep 7, 2022 at 5:48 PM Jesper Dangaard Brouer 
>> <brouer@redhat.com> wrote:
>>>
>>> From: Maryam Tahhan <mtahhan@redhat.com>
>>>
>>> Simply set AF_XDP descriptor options to XDP flags.
>>>
>>> Jesper: Will this really be acceptable by AF_XDP maintainers?
>>
>> Maryam, you guessed correctly that dedicating all these options bits
>> for a single feature will not be ok :-). E.g., I want one bit for the
>> AF_XDP multi-buffer support and who knows what other uses there might
>> be for this options field in the future. Let us try to solve this in
>> some other way. Here are some suggestions, all with their pros and
>> cons.
>>
> 
> TBH it was Jespers question :)

True. I'm generally questioning this patch...
... and indirectly asking Magnus.  (If you noticed, I didn't add my SoB)

>> * Put this feature flag at a known place in the metadata area, for
>> example just before the BTF ID. No need to fill this in if you are not
>> redirecting to AF_XDP, but at a redirect to AF_XDP, the XDP flags are
>> copied into this u32 in the metadata area so that user-space can
>> consume it. Will cost 4 bytes of the metadata area though.
> 
> If Jesper agrees I think this approach would make sense. Trying to
> translate encodings into some other flags for AF_XDP I think will lead
> to a growing set of translations as more options come along.
> The other thing to be aware of is just making sure to clear/zero the 
> metadata space in the buffers at some point (ideally when the descriptor 
> is returned from the application) so when the buffers are used again
> they are already in a "reset" state.

I don't like this option ;-)

First of all because this can give false positives, if "XDP flags copied
into metadata area" is used for something else.  This can easily happen
as XDP BPF-progs are free to metadata for something else.

Second reason, because it would require AF_XDP to always read the
metadata cache-line (and write, if clearing on "return").  Not a good
optioon, given how performance sensitive AF_XDP workloads (at least
benchmarks).

>>
>> * Instead encode this information into each metadata entry in the
>> metadata area, in some way so that a flags field is not needed (-1
>> signifies not valid, or whatever happens to make sense). This has the
>> drawback that the user might have to look at a large number of entries
>> just to find out there is nothing valid to read. To alleviate this, it
>> could be combined with the next suggestion.
>>
>> * Dedicate one bit in the options field to indicate that there is at
>> least one valid metadata entry in the metadata area. This could be
>> combined with the two approaches above. However, depending on what
>> metadata you have enabled, this bit might be pointless. If some
>> metadata is always valid, then it serves no purpose. But it might if
>> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
>> on one packet out of one thousand.
>>

I like this option better! Except that I have hoped to get 2 bits ;-)

The performance advantage is that the AF_XDP descriptor bits will 
already be cache-hot, and if it indicates no-metadata-hints the AF_XDP 
application can avoid reading the metadata cache-line :-).

When metadata is valid and contains valid XDP-hints can change between 
two packets.  E.g. XDP-hints can be enabled/disabled via ethtool, and 
the content can be enabled/disabled by other ethtool commands, and even 
setsockopt calls (e.g timestamping).  An XDP prog can also choose to use 
the area for something else for a subset of the packets.

It is a design choice in this patchset to avoid locking down the NIC 
driver to a fixed XDP-hints layout, and avoid locking/disabling other 
ethtool config setting to keeping XDP-hints layout stable.  Originally I 
wanted this, but I realized that it would be impossible (and annoying 
for users) if we had to control every config interface to NIC hardware 
offload hints, to keep XDP-hints "always-valid".

--Jesper

>>> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
>>> ---
>>>   include/uapi/linux/if_xdp.h |    2 +-
>>>   net/xdp/xsk.c               |    2 +-
>>>   net/xdp/xsk_queue.h         |    3 ++-
>>>   3 files changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>> index a78a8096f4ce..9335b56474e7 100644
>>> --- a/include/uapi/linux/if_xdp.h
>>> +++ b/include/uapi/linux/if_xdp.h
>>> @@ -103,7 +103,7 @@ struct xdp_options {
>>>   struct xdp_desc {
>>>          __u64 addr;
>>>          __u32 len;
>>> -       __u32 options;
>>> +       __u32 options; /* set to the values of xdp_hints_flags*/
>>>   };
>>>
>>>   /* UMEM descriptor is __u64 */
>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>> index 5b4ce6ba1bc7..32095d78f06b 100644
>>> --- a/net/xdp/xsk.c
>>> +++ b/net/xdp/xsk.c
>>> @@ -141,7 +141,7 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, 
>>> struct xdp_buff *xdp, u32 len)
>>>          int err;
>>>
>>>          addr = xp_get_handle(xskb);
>>> -       err = xskq_prod_reserve_desc(xs->rx, addr, len);
>>> +       err = xskq_prod_reserve_desc(xs->rx, addr, len, xdp->flags);
>>>          if (err) {
>>>                  xs->rx_queue_full++;
>>>                  return err;
>>> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
>>> index fb20bf7207cf..7a66f082f97e 100644
>>> --- a/net/xdp/xsk_queue.h
>>> +++ b/net/xdp/xsk_queue.h
>>> @@ -368,7 +368,7 @@ static inline u32 
>>> xskq_prod_reserve_addr_batch(struct xsk_queue *q, struct xdp_d
>>>   }
>>>
>>>   static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
>>> -                                        u64 addr, u32 len)
>>> +                                        u64 addr, u32 len, u32 flags)
>>>   {
>>>          struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>>>          u32 idx;
>>> @@ -380,6 +380,7 @@ static inline int xskq_prod_reserve_desc(struct 
>>> xsk_queue *q,
>>>          idx = q->cached_prod++ & q->ring_mask;
>>>          ring->desc[idx].addr = addr;
>>>          ring->desc[idx].len = len;
>>> +       ring->desc[idx].options = flags;
>>>
>>>          return 0;
>>>   }
>>>
>>>
>>
>
Magnus Karlsson Sept. 9, 2022, 6:43 a.m. UTC | #4
On Thu, Sep 8, 2022 at 5:04 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 08/09/2022 12.10, Maryam Tahhan wrote:
> > On 08/09/2022 09:06, Magnus Karlsson wrote:
> >> On Wed, Sep 7, 2022 at 5:48 PM Jesper Dangaard Brouer
> >> <brouer@redhat.com> wrote:
> >>>
> >>> From: Maryam Tahhan <mtahhan@redhat.com>
> >>>
> >>> Simply set AF_XDP descriptor options to XDP flags.
> >>>
> >>> Jesper: Will this really be acceptable by AF_XDP maintainers?
> >>
> >> Maryam, you guessed correctly that dedicating all these options bits
> >> for a single feature will not be ok :-). E.g., I want one bit for the
> >> AF_XDP multi-buffer support and who knows what other uses there might
> >> be for this options field in the future. Let us try to solve this in
> >> some other way. Here are some suggestions, all with their pros and
> >> cons.
> >>
> >
> > TBH it was Jespers question :)
>
> True. I'm generally questioning this patch...
> ... and indirectly asking Magnus.  (If you noticed, I didn't add my SoB)
>
> >> * Put this feature flag at a known place in the metadata area, for
> >> example just before the BTF ID. No need to fill this in if you are not
> >> redirecting to AF_XDP, but at a redirect to AF_XDP, the XDP flags are
> >> copied into this u32 in the metadata area so that user-space can
> >> consume it. Will cost 4 bytes of the metadata area though.
> >
> > If Jesper agrees I think this approach would make sense. Trying to
> > translate encodings into some other flags for AF_XDP I think will lead
> > to a growing set of translations as more options come along.
> > The other thing to be aware of is just making sure to clear/zero the
> > metadata space in the buffers at some point (ideally when the descriptor
> > is returned from the application) so when the buffers are used again
> > they are already in a "reset" state.
>
> I don't like this option ;-)
>
> First of all because this can give false positives, if "XDP flags copied
> into metadata area" is used for something else.  This can easily happen
> as XDP BPF-progs are free to metadata for something else.

Are XDP programs not free to overwrite the BTF id that you have last
in the md section too and you can get false positives for that as
well? Or do you protect it in some way? Sorry, but I do not understand
why a flags field would be different from a BTF id stored in the
metadata section.

> Second reason, because it would require AF_XDP to always read the
> metadata cache-line (and write, if clearing on "return").  Not a good
> optioon, given how performance sensitive AF_XDP workloads (at least
> benchmarks).

On its own, you are right, but when combined with the "bit in the
descriptor" proposal below, you would not get this performance
penalty. If the bit is zero, you do not have to read the MD cache
line. If the bit is one, you want to read the MD line to get your
metadata anyway, so one more read on the same cache line to get the
flags would not hurt performance. (There is of course a case where the
4 extra bytes of the flags could push the metadata you are interested
in to a new cache line, but this should be rare.)

But it all depends on if you need the resolution of a u32 flags field.
If not, forget this idea. If you do, then the metadata section is the
only place for it.

> >>
> >> * Instead encode this information into each metadata entry in the
> >> metadata area, in some way so that a flags field is not needed (-1
> >> signifies not valid, or whatever happens to make sense). This has the
> >> drawback that the user might have to look at a large number of entries
> >> just to find out there is nothing valid to read. To alleviate this, it
> >> could be combined with the next suggestion.
> >>
> >> * Dedicate one bit in the options field to indicate that there is at
> >> least one valid metadata entry in the metadata area. This could be
> >> combined with the two approaches above. However, depending on what
> >> metadata you have enabled, this bit might be pointless. If some
> >> metadata is always valid, then it serves no purpose. But it might if
> >> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
> >> on one packet out of one thousand.
> >>
>
> I like this option better! Except that I have hoped to get 2 bits ;-)

I will give you two if you need it Jesper, no problem :-).

> The performance advantage is that the AF_XDP descriptor bits will
> already be cache-hot, and if it indicates no-metadata-hints the AF_XDP
> application can avoid reading the metadata cache-line :-).

Agreed. I prefer if we can keep it simple and fast like this.

> When metadata is valid and contains valid XDP-hints can change between
> two packets.  E.g. XDP-hints can be enabled/disabled via ethtool, and
> the content can be enabled/disabled by other ethtool commands, and even
> setsockopt calls (e.g timestamping).  An XDP prog can also choose to use
> the area for something else for a subset of the packets.
>
> It is a design choice in this patchset to avoid locking down the NIC
> driver to a fixed XDP-hints layout, and avoid locking/disabling other
> ethtool config setting to keeping XDP-hints layout stable.  Originally I
> wanted this, but I realized that it would be impossible (and annoying
> for users) if we had to control every config interface to NIC hardware
> offload hints, to keep XDP-hints "always-valid".

> --Jesper
>
> >>> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
> >>> ---
> >>>   include/uapi/linux/if_xdp.h |    2 +-
> >>>   net/xdp/xsk.c               |    2 +-
> >>>   net/xdp/xsk_queue.h         |    3 ++-
> >>>   3 files changed, 4 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> >>> index a78a8096f4ce..9335b56474e7 100644
> >>> --- a/include/uapi/linux/if_xdp.h
> >>> +++ b/include/uapi/linux/if_xdp.h
> >>> @@ -103,7 +103,7 @@ struct xdp_options {
> >>>   struct xdp_desc {
> >>>          __u64 addr;
> >>>          __u32 len;
> >>> -       __u32 options;
> >>> +       __u32 options; /* set to the values of xdp_hints_flags*/
> >>>   };
> >>>
> >>>   /* UMEM descriptor is __u64 */
> >>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >>> index 5b4ce6ba1bc7..32095d78f06b 100644
> >>> --- a/net/xdp/xsk.c
> >>> +++ b/net/xdp/xsk.c
> >>> @@ -141,7 +141,7 @@ static int __xsk_rcv_zc(struct xdp_sock *xs,
> >>> struct xdp_buff *xdp, u32 len)
> >>>          int err;
> >>>
> >>>          addr = xp_get_handle(xskb);
> >>> -       err = xskq_prod_reserve_desc(xs->rx, addr, len);
> >>> +       err = xskq_prod_reserve_desc(xs->rx, addr, len, xdp->flags);
> >>>          if (err) {
> >>>                  xs->rx_queue_full++;
> >>>                  return err;
> >>> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> >>> index fb20bf7207cf..7a66f082f97e 100644
> >>> --- a/net/xdp/xsk_queue.h
> >>> +++ b/net/xdp/xsk_queue.h
> >>> @@ -368,7 +368,7 @@ static inline u32
> >>> xskq_prod_reserve_addr_batch(struct xsk_queue *q, struct xdp_d
> >>>   }
> >>>
> >>>   static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
> >>> -                                        u64 addr, u32 len)
> >>> +                                        u64 addr, u32 len, u32 flags)
> >>>   {
> >>>          struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
> >>>          u32 idx;
> >>> @@ -380,6 +380,7 @@ static inline int xskq_prod_reserve_desc(struct
> >>> xsk_queue *q,
> >>>          idx = q->cached_prod++ & q->ring_mask;
> >>>          ring->desc[idx].addr = addr;
> >>>          ring->desc[idx].len = len;
> >>> +       ring->desc[idx].options = flags;
> >>>
> >>>          return 0;
> >>>   }
> >>>
> >>>
> >>
> >
>
Maryam Tahhan Sept. 9, 2022, 8:12 a.m. UTC | #5
<snip>
>>>>
>>>> * Instead encode this information into each metadata entry in the
>>>> metadata area, in some way so that a flags field is not needed (-1
>>>> signifies not valid, or whatever happens to make sense). This has the
>>>> drawback that the user might have to look at a large number of entries
>>>> just to find out there is nothing valid to read. To alleviate this, it
>>>> could be combined with the next suggestion.
>>>>
>>>> * Dedicate one bit in the options field to indicate that there is at
>>>> least one valid metadata entry in the metadata area. This could be
>>>> combined with the two approaches above. However, depending on what
>>>> metadata you have enabled, this bit might be pointless. If some
>>>> metadata is always valid, then it serves no purpose. But it might if
>>>> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
>>>> on one packet out of one thousand.
>>>>
>>
>> I like this option better! Except that I have hoped to get 2 bits ;-)
> 
> I will give you two if you need it Jesper, no problem :-).
> 

Ok I will look at implementing and testing this and post an update.

Thanks folks

>> The performance advantage is that the AF_XDP descriptor bits will
>> already be cache-hot, and if it indicates no-metadata-hints the AF_XDP
>> application can avoid reading the metadata cache-line :-).
> 
> Agreed. I prefer if we can keep it simple and fast like this.
> 
<snip>
Jesper Dangaard Brouer Sept. 9, 2022, 9:42 a.m. UTC | #6
On 09/09/2022 10.12, Maryam Tahhan wrote:
> <snip>
>>>>>
>>>>> * Instead encode this information into each metadata entry in the
>>>>> metadata area, in some way so that a flags field is not needed (-1
>>>>> signifies not valid, or whatever happens to make sense). This has the
>>>>> drawback that the user might have to look at a large number of entries
>>>>> just to find out there is nothing valid to read. To alleviate this, it
>>>>> could be combined with the next suggestion.
>>>>>
>>>>> * Dedicate one bit in the options field to indicate that there is at
>>>>> least one valid metadata entry in the metadata area. This could be
>>>>> combined with the two approaches above. However, depending on what
>>>>> metadata you have enabled, this bit might be pointless. If some
>>>>> metadata is always valid, then it serves no purpose. But it might if
>>>>> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
>>>>> on one packet out of one thousand.
>>>>>
>>>
>>> I like this option better! Except that I have hoped to get 2 bits ;-)
>>
>> I will give you two if you need it Jesper, no problem :-).
>>
> 
> Ok I will look at implementing and testing this and post an update.

Perfect if you Maryam have cycles to work on this.

Let me explain what I wanted the 2nd bit for.  I simply wanted to also
transfer the XDP_FLAGS_HINTS_COMPAT_COMMON flag.  One could argue that
is it redundant information as userspace AF_XDP will have to BTF decode
all the know XDP-hints. Thus, it could know if a BTF type ID is
compatible with the common struct.   This problem is performance as my
userspace AF_XDP code will have to do more code (switch/jump-table or
table lookup) to map IDs to common compat (to e.g. extract the RX-csum
indication).  Getting this extra "common-compat" bit is actually a
micro-optimization.  It is up to AF_XDP maintainers if they can spare
this bit.


> Thanks folks
> 
>>> The performance advantage is that the AF_XDP descriptor bits will
>>> already be cache-hot, and if it indicates no-metadata-hints the AF_XDP
>>> application can avoid reading the metadata cache-line :-).
>>
>> Agreed. I prefer if we can keep it simple and fast like this.
>>

Great, lets proceed this way then.

> <snip>
> 

Thinking ahead: We will likely need 3 bits.

The idea is that for TX-side, we set a bit indicating that AF_XDP have
provided a valid XDP-hints layout (incl corresponding BTF ID). (I would
overload and reuse "common-compat" bit if TX gets a common struct).

But lets land RX-side first, but make sure we can easily extend for the 
TX-side.

--Jesper
Magnus Karlsson Sept. 9, 2022, 10:14 a.m. UTC | #7
On Fri, Sep 9, 2022 at 11:42 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 09/09/2022 10.12, Maryam Tahhan wrote:
> > <snip>
> >>>>>
> >>>>> * Instead encode this information into each metadata entry in the
> >>>>> metadata area, in some way so that a flags field is not needed (-1
> >>>>> signifies not valid, or whatever happens to make sense). This has the
> >>>>> drawback that the user might have to look at a large number of entries
> >>>>> just to find out there is nothing valid to read. To alleviate this, it
> >>>>> could be combined with the next suggestion.
> >>>>>
> >>>>> * Dedicate one bit in the options field to indicate that there is at
> >>>>> least one valid metadata entry in the metadata area. This could be
> >>>>> combined with the two approaches above. However, depending on what
> >>>>> metadata you have enabled, this bit might be pointless. If some
> >>>>> metadata is always valid, then it serves no purpose. But it might if
> >>>>> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
> >>>>> on one packet out of one thousand.
> >>>>>
> >>>
> >>> I like this option better! Except that I have hoped to get 2 bits ;-)
> >>
> >> I will give you two if you need it Jesper, no problem :-).
> >>
> >
> > Ok I will look at implementing and testing this and post an update.
>
> Perfect if you Maryam have cycles to work on this.
>
> Let me explain what I wanted the 2nd bit for.  I simply wanted to also
> transfer the XDP_FLAGS_HINTS_COMPAT_COMMON flag.  One could argue that
> is it redundant information as userspace AF_XDP will have to BTF decode
> all the know XDP-hints. Thus, it could know if a BTF type ID is
> compatible with the common struct.   This problem is performance as my
> userspace AF_XDP code will have to do more code (switch/jump-table or
> table lookup) to map IDs to common compat (to e.g. extract the RX-csum
> indication).  Getting this extra "common-compat" bit is actually a
> micro-optimization.  It is up to AF_XDP maintainers if they can spare
> this bit.
>
>
> > Thanks folks
> >
> >>> The performance advantage is that the AF_XDP descriptor bits will
> >>> already be cache-hot, and if it indicates no-metadata-hints the AF_XDP
> >>> application can avoid reading the metadata cache-line :-).
> >>
> >> Agreed. I prefer if we can keep it simple and fast like this.
> >>
>
> Great, lets proceed this way then.
>
> > <snip>
> >
>
> Thinking ahead: We will likely need 3 bits.
>
> The idea is that for TX-side, we set a bit indicating that AF_XDP have
> provided a valid XDP-hints layout (incl corresponding BTF ID). (I would
> overload and reuse "common-compat" bit if TX gets a common struct).

I think we should reuse the "Rx metadata valid" flag for this since
this will not be used in the Tx case by definition. In the Tx case,
this bit would instead mean that the user has provided a valid
XDP-hints layout. It has a nice symmetry, on Rx it is set by the
kernel when it has put something relevant in the metadata area. On Tx,
it is set by user-space if it has put something relevant in the
metadata area. We can also reuse this bit when we get a notification
in the completion queue to indicate if the kernel has produced some
metadata on tx completions. This could be a Tx timestamp for example.

So hopefully we could live with only two bits :-).

> But lets land RX-side first, but make sure we can easily extend for the
> TX-side.
>
> --Jesper
>
Jesper Dangaard Brouer Sept. 9, 2022, 12:35 p.m. UTC | #8
On 09/09/2022 12.14, Magnus Karlsson wrote:
> On Fri, Sep 9, 2022 at 11:42 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>> On 09/09/2022 10.12, Maryam Tahhan wrote:
>>> <snip>
>>>>>>>
>>>>>>> * Instead encode this information into each metadata entry in the
>>>>>>> metadata area, in some way so that a flags field is not needed (-1
>>>>>>> signifies not valid, or whatever happens to make sense). This has the
>>>>>>> drawback that the user might have to look at a large number of entries
>>>>>>> just to find out there is nothing valid to read. To alleviate this, it
>>>>>>> could be combined with the next suggestion.
>>>>>>>
>>>>>>> * Dedicate one bit in the options field to indicate that there is at
>>>>>>> least one valid metadata entry in the metadata area. This could be
>>>>>>> combined with the two approaches above. However, depending on what
>>>>>>> metadata you have enabled, this bit might be pointless. If some
>>>>>>> metadata is always valid, then it serves no purpose. But it might if
>>>>>>> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
>>>>>>> on one packet out of one thousand.
>>>>>>>
>>>>>
>>>>> I like this option better! Except that I have hoped to get 2 bits ;-)
>>>>
>>>> I will give you two if you need it Jesper, no problem :-).
>>>>
>>>
>>> Ok I will look at implementing and testing this and post an update.
>>
>> Perfect if you Maryam have cycles to work on this.
>>
>> Let me explain what I wanted the 2nd bit for.  I simply wanted to also
>> transfer the XDP_FLAGS_HINTS_COMPAT_COMMON flag.  One could argue that
>> is it redundant information as userspace AF_XDP will have to BTF decode
>> all the know XDP-hints. Thus, it could know if a BTF type ID is
>> compatible with the common struct.   This problem is performance as my
>> userspace AF_XDP code will have to do more code (switch/jump-table or
>> table lookup) to map IDs to common compat (to e.g. extract the RX-csum
>> indication).  Getting this extra "common-compat" bit is actually a
>> micro-optimization.  It is up to AF_XDP maintainers if they can spare
>> this bit.
>>
>>
>>> Thanks folks
>>>
>>>>> The performance advantage is that the AF_XDP descriptor bits will
>>>>> already be cache-hot, and if it indicates no-metadata-hints the AF_XDP
>>>>> application can avoid reading the metadata cache-line :-).
>>>>
>>>> Agreed. I prefer if we can keep it simple and fast like this.
>>>>
>>
>> Great, lets proceed this way then.
>>
>>> <snip>
>>>
>>
>> Thinking ahead: We will likely need 3 bits.
>>
>> The idea is that for TX-side, we set a bit indicating that AF_XDP have
>> provided a valid XDP-hints layout (incl corresponding BTF ID). (I would
>> overload and reuse "common-compat" bit if TX gets a common struct).
> 
> I think we should reuse the "Rx metadata valid" flag for this since
> this will not be used in the Tx case by definition. In the Tx case,
> this bit would instead mean that the user has provided a valid
> XDP-hints layout. It has a nice symmetry, on Rx it is set by the
> kernel when it has put something relevant in the metadata area. On Tx,
> it is set by user-space if it has put something relevant in the
> metadata area. 

I generally like reusing the bit, *BUT* there is the problem of 
(existing) applications ignoring the desc-options bit and forwarding 
packets.  This would cause the "Rx metadata valid" flag to be seen as 
userspace having set the "TX-hints-bit" and kernel would use what is 
provided in metadata area (leftovers from RX-hints).  IMHO that will be 
hard to debug for end-users and likely break existing applications.

> We can also reuse this bit when we get a notification
> in the completion queue to indicate if the kernel has produced some
> metadata on tx completions. This could be a Tx timestamp for example.
> 

Big YES, reuse "Rx metadata valid" bit when we get a TX notification in 
completion queue.  This will be okay because it cannot be forgotten and 
misinterpreted as the kernel will have responsibility to update this bit.

> So hopefully we could live with only two bits :-).
> 

I still think we need three bits ;-)
That should be enough to cover the 6 states:
  - RX hints
  - RX hints and compat
  - TX hints
  - TX hints and compat
  - TX completion
  - TX completion and compat


>> But lets land RX-side first, but make sure we can easily extend for the
>> TX-side.
Magnus Karlsson Sept. 9, 2022, 12:44 p.m. UTC | #9
On Fri, Sep 9, 2022 at 2:35 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 09/09/2022 12.14, Magnus Karlsson wrote:
> > On Fri, Sep 9, 2022 at 11:42 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >> On 09/09/2022 10.12, Maryam Tahhan wrote:
> >>> <snip>
> >>>>>>>
> >>>>>>> * Instead encode this information into each metadata entry in the
> >>>>>>> metadata area, in some way so that a flags field is not needed (-1
> >>>>>>> signifies not valid, or whatever happens to make sense). This has the
> >>>>>>> drawback that the user might have to look at a large number of entries
> >>>>>>> just to find out there is nothing valid to read. To alleviate this, it
> >>>>>>> could be combined with the next suggestion.
> >>>>>>>
> >>>>>>> * Dedicate one bit in the options field to indicate that there is at
> >>>>>>> least one valid metadata entry in the metadata area. This could be
> >>>>>>> combined with the two approaches above. However, depending on what
> >>>>>>> metadata you have enabled, this bit might be pointless. If some
> >>>>>>> metadata is always valid, then it serves no purpose. But it might if
> >>>>>>> all enabled metadata is rarely valid, e.g., if you get an Rx timestamp
> >>>>>>> on one packet out of one thousand.
> >>>>>>>
> >>>>>
> >>>>> I like this option better! Except that I have hoped to get 2 bits ;-)
> >>>>
> >>>> I will give you two if you need it Jesper, no problem :-).
> >>>>
> >>>
> >>> Ok I will look at implementing and testing this and post an update.
> >>
> >> Perfect if you Maryam have cycles to work on this.
> >>
> >> Let me explain what I wanted the 2nd bit for.  I simply wanted to also
> >> transfer the XDP_FLAGS_HINTS_COMPAT_COMMON flag.  One could argue that
> >> is it redundant information as userspace AF_XDP will have to BTF decode
> >> all the know XDP-hints. Thus, it could know if a BTF type ID is
> >> compatible with the common struct.   This problem is performance as my
> >> userspace AF_XDP code will have to do more code (switch/jump-table or
> >> table lookup) to map IDs to common compat (to e.g. extract the RX-csum
> >> indication).  Getting this extra "common-compat" bit is actually a
> >> micro-optimization.  It is up to AF_XDP maintainers if they can spare
> >> this bit.
> >>
> >>
> >>> Thanks folks
> >>>
> >>>>> The performance advantage is that the AF_XDP descriptor bits will
> >>>>> already be cache-hot, and if it indicates no-metadata-hints the AF_XDP
> >>>>> application can avoid reading the metadata cache-line :-).
> >>>>
> >>>> Agreed. I prefer if we can keep it simple and fast like this.
> >>>>
> >>
> >> Great, lets proceed this way then.
> >>
> >>> <snip>
> >>>
> >>
> >> Thinking ahead: We will likely need 3 bits.
> >>
> >> The idea is that for TX-side, we set a bit indicating that AF_XDP have
> >> provided a valid XDP-hints layout (incl corresponding BTF ID). (I would
> >> overload and reuse "common-compat" bit if TX gets a common struct).
> >
> > I think we should reuse the "Rx metadata valid" flag for this since
> > this will not be used in the Tx case by definition. In the Tx case,
> > this bit would instead mean that the user has provided a valid
> > XDP-hints layout. It has a nice symmetry, on Rx it is set by the
> > kernel when it has put something relevant in the metadata area. On Tx,
> > it is set by user-space if it has put something relevant in the
> > metadata area.
>
> I generally like reusing the bit, *BUT* there is the problem of
> (existing) applications ignoring the desc-options bit and forwarding
> packets.  This would cause the "Rx metadata valid" flag to be seen as
> userspace having set the "TX-hints-bit" and kernel would use what is
> provided in metadata area (leftovers from RX-hints).  IMHO that will be
> hard to debug for end-users and likely break existing applications.

Good point. I buy this. We need separate Rx and Tx bits.

> > We can also reuse this bit when we get a notification
> > in the completion queue to indicate if the kernel has produced some
> > metadata on tx completions. This could be a Tx timestamp for example.
> >
>
> Big YES, reuse "Rx metadata valid" bit when we get a TX notification in
> completion queue.  This will be okay because it cannot be forgotten and
> misinterpreted as the kernel will have responsibility to update this bit.
>
> > So hopefully we could live with only two bits :-).
> >
>
> I still think we need three bits ;-)
> That should be enough to cover the 6 states:
>   - RX hints
>   - RX hints and compat
>   - TX hints
>   - TX hints and compat
>   - TX completion
>   - TX completion and compat
>
>
> >> But lets land RX-side first, but make sure we can easily extend for the
> >> TX-side.
>
diff mbox series

Patch

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index a78a8096f4ce..9335b56474e7 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -103,7 +103,7 @@  struct xdp_options {
 struct xdp_desc {
 	__u64 addr;
 	__u32 len;
-	__u32 options;
+	__u32 options; /* set to the values of xdp_hints_flags*/
 };
 
 /* UMEM descriptor is __u64 */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5b4ce6ba1bc7..32095d78f06b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -141,7 +141,7 @@  static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 	int err;
 
 	addr = xp_get_handle(xskb);
-	err = xskq_prod_reserve_desc(xs->rx, addr, len);
+	err = xskq_prod_reserve_desc(xs->rx, addr, len, xdp->flags);
 	if (err) {
 		xs->rx_queue_full++;
 		return err;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index fb20bf7207cf..7a66f082f97e 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -368,7 +368,7 @@  static inline u32 xskq_prod_reserve_addr_batch(struct xsk_queue *q, struct xdp_d
 }
 
 static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
-					 u64 addr, u32 len)
+					 u64 addr, u32 len, u32 flags)
 {
 	struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
 	u32 idx;
@@ -380,6 +380,7 @@  static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
 	idx = q->cached_prod++ & q->ring_mask;
 	ring->desc[idx].addr = addr;
 	ring->desc[idx].len = len;
+	ring->desc[idx].options = flags;
 
 	return 0;
 }