diff mbox series

[RFC,bpf-next,5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver

Message ID 20221027200019.4106375-6-sdf@google.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series xdp: hints via kfuncs | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 pending Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-2 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 27 this patch: 27
netdev/cc_maintainers warning 9 maintainers not CCed: jonathan.lemon@gmail.com maciej.fijalkowski@intel.com davem@davemloft.net magnus.karlsson@intel.com hawk@kernel.org linux-kselftest@vger.kernel.org bjorn@kernel.org shuah@kernel.org mykolal@fb.com
netdev/build_clang fail Errors and warnings before: 16 this patch: 16
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 27 this patch: 27
netdev/checkpatch warning WARNING: line length of 100 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Stanislav Fomichev Oct. 27, 2022, 8 p.m. UTC
Example on how the metadata is prepared from the BPF context
and consumed by AF_XDP:

- bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
  if not, I'm assuming verifier will remove this "if (0)" branch
- bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
  the program has to bpf_xdp_adjust_meta+memcpy it;
  maybe returning a pointer is better?
- af_xdp consumer grabs it from data-<expected_metadata_offset> and
  makes sure timestamp is not empty
- when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
 tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletion(-)

Comments

Martin KaFai Lau Oct. 28, 2022, 6:22 a.m. UTC | #1
On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> Example on how the metadata is prepared from the BPF context
> and consumed by AF_XDP:
> 
> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
>    if not, I'm assuming verifier will remove this "if (0)" branch
> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
>    the program has to bpf_xdp_adjust_meta+memcpy it;
>    maybe returning a pointer is better?
> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
>    makes sure timestamp is not empty
> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> 
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
>   2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c b/tools/testing/selftests/bpf/progs/xskxceiver.c
> index b135daddad3a..83c879aa3581 100644
> --- a/tools/testing/selftests/bpf/progs/xskxceiver.c
> +++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
> @@ -12,9 +12,31 @@ struct {
>   	__type(value, __u32);
>   } xsk SEC(".maps");
>   
> +extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx) __ksym;
> +extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> +
>   SEC("xdp")
>   int rx(struct xdp_md *ctx)
>   {
> +	void *data, *data_meta;
> +	__u32 rx_timestamp;
> +	int ret;
> +
> +	if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {
> +		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));
> +		if (ret != 0)
> +			return XDP_DROP;
> +
> +		data = (void *)(long)ctx->data;
> +		data_meta = (void *)(long)ctx->data_meta;
> +
> +		if (data_meta + sizeof(__u32) > data)
> +			return XDP_DROP;
> +
> +		rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> +		__builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> +	}

Thanks for the patches.  I took a quick look at patch 1 and 2 but haven't had a 
chance to look at the implementation details (eg. KF_UNROLL...etc), yet.

Overall (with the example here) looks promising.  There is a lot of flexibility 
on whether the xdp prog needs any hint at all, which hint it needs, and how to 
store it.

> +
>   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>   }
>   
> diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
> index 066bd691db13..ce82c89a432e 100644
> --- a/tools/testing/selftests/bpf/xskxceiver.c
> +++ b/tools/testing/selftests/bpf/xskxceiver.c
> @@ -871,7 +871,9 @@ static bool is_offset_correct(struct xsk_umem_info *umem, struct pkt_stream *pkt
>   static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
>   {
>   	void *data = xsk_umem__get_data(buffer, addr);
> +	void *data_meta = data - sizeof(__u32);
>   	struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct ethhdr));
> +	__u32 rx_timestamp = 0;
>   
>   	if (!pkt) {
>   		ksft_print_msg("[%s] too many packets received\n", __func__);
> @@ -907,6 +909,13 @@ static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
>   		return false;
>   	}
>   
> +	memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));
> +	if (rx_timestamp == 0) {
> +		ksft_print_msg("Invalid metadata received: ");
> +		ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
> +		return false;
> +	}
> +
>   	return true;
>   }

>
Jesper Dangaard Brouer Oct. 28, 2022, 10:37 a.m. UTC | #2
On 28/10/2022 08.22, Martin KaFai Lau wrote:
> On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
>> Example on how the metadata is prepared from the BPF context
>> and consumed by AF_XDP:
>>
>> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
>>    if not, I'm assuming verifier will remove this "if (0)" branch
>> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
>>    the program has to bpf_xdp_adjust_meta+memcpy it;
>>    maybe returning a pointer is better?
>> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
>>    makes sure timestamp is not empty
>> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
>>
>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Willem de Bruijn <willemb@google.com>
>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>> Cc: xdp-hints@xdp-project.net
>> Cc: netdev@vger.kernel.org
>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> ---
>>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
>>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
>>   2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c 
>> b/tools/testing/selftests/bpf/progs/xskxceiver.c
>> index b135daddad3a..83c879aa3581 100644
>> --- a/tools/testing/selftests/bpf/progs/xskxceiver.c
>> +++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
>> @@ -12,9 +12,31 @@ struct {
>>       __type(value, __u32);
>>   } xsk SEC(".maps");
>> +extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx) 
>> __ksym;
>> +extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
>> +
>>   SEC("xdp")
>>   int rx(struct xdp_md *ctx)
>>   {
>> +    void *data, *data_meta;
>> +    __u32 rx_timestamp;
>> +    int ret;
>> +
>> +    if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {

In current veth implementation, bpf_xdp_metadata_have_rx_timestamp()
will always return true here.

In the case of hardware timestamps, not every packet will contain a 
hardware timestamp.  (See my/Maryam ixgbe patch, where timestamps are 
read via HW device register, which isn't fast, and HW only support this 
for timesync protocols like PTP).

How do you imagine we can extend this?

>> +        ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));

IMHO sizeof() should come from a struct describing data_meta area see:
 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62


>> +        if (ret != 0)
>> +            return XDP_DROP;
>> +
>> +        data = (void *)(long)ctx->data;
>> +        data_meta = (void *)(long)ctx->data_meta;
>> +
>> +        if (data_meta + sizeof(__u32) > data)
>> +            return XDP_DROP;
>> +
>> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));

So, this approach first stores hints on some other memory location, and 
then need to copy over information into data_meta area. That isn't good 
from a performance perspective.

My idea is to store it in the final data_meta destination immediately.

Do notice that in my approach, the existing ethtool config setting and 
socket options (for timestamps) still apply.  Thus, each individual 
hardware hint are already configurable. Thus we already have a config 
interface. I do acknowledge, that in-case a feature is disabled it still 
takes up space in data_meta areas, but importantly it is NOT stored into 
the area (for performance reasons).


>> +    }
> 
> Thanks for the patches.  I took a quick look at patch 1 and 2 but 
> haven't had a chance to look at the implementation details (eg. 
> KF_UNROLL...etc), yet.
> 

Yes, thanks for the patches, even-though I don't agree with the
approach, at-least until my concerns/use-case can be resolved.
IMHO the best way to convince people is through code. So, thank you for
the effort.  Hopefully we can use some of these ideas and I can also
change/adjust my XDP-hints ideas to incorporate some of this :-)


> Overall (with the example here) looks promising.  There is a lot of 
> flexibility on whether the xdp prog needs any hint at all, which hint it 
> needs, and how to store it.
> 

I do see the advantage that XDP prog only populates metadata it needs.
But how can we use/access this in __xdp_build_skb_from_frame() ?


>> +
>>       return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>>   }
>> diff --git a/tools/testing/selftests/bpf/xskxceiver.c 
>> b/tools/testing/selftests/bpf/xskxceiver.c
>> index 066bd691db13..ce82c89a432e 100644
>> --- a/tools/testing/selftests/bpf/xskxceiver.c
>> +++ b/tools/testing/selftests/bpf/xskxceiver.c
>> @@ -871,7 +871,9 @@ static bool is_offset_correct(struct xsk_umem_info 
>> *umem, struct pkt_stream *pkt
>>   static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, 
>> u32 len)
>>   {
>>       void *data = xsk_umem__get_data(buffer, addr);
>> +    void *data_meta = data - sizeof(__u32);
>>       struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct 
>> ethhdr));
>> +    __u32 rx_timestamp = 0;
>>       if (!pkt) {
>>           ksft_print_msg("[%s] too many packets received\n", __func__);
>> @@ -907,6 +909,13 @@ static bool is_pkt_valid(struct pkt *pkt, void 
>> *buffer, u64 addr, u32 len)
>>           return false;
>>       }
>> +    memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));

I acknowledge that it is too extensive to add to this patch, but in my 
AF_XDP-interaction example[1], I'm creating a struct xdp_hints_rx_time 
that gets BTF exported[1][2] to the userspace application, and userspace 
decodes the BTF and gets[3] a xsk_btf_member struct for members that 
simply contains a offset+size to read from.

[1] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L47-L51

[2] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80

[3] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_user.c#L123-L129

>> +    if (rx_timestamp == 0) {
>> +        ksft_print_msg("Invalid metadata received: ");
>> +        ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
>> +        return false;
>> +    }
>> +
>>       return true;
>>   }
> 

Looking forward to collaborate :-)
--Jesper
Stanislav Fomichev Oct. 28, 2022, 6:46 p.m. UTC | #3
On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 28/10/2022 08.22, Martin KaFai Lau wrote:
> > On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> >> Example on how the metadata is prepared from the BPF context
> >> and consumed by AF_XDP:
> >>
> >> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> >>    if not, I'm assuming verifier will remove this "if (0)" branch
> >> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> >>    the program has to bpf_xdp_adjust_meta+memcpy it;
> >>    maybe returning a pointer is better?
> >> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> >>    makes sure timestamp is not empty
> >> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> >>
> >> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> >> Cc: Jakub Kicinski <kuba@kernel.org>
> >> Cc: Willem de Bruijn <willemb@google.com>
> >> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> >> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> >> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> >> Cc: Maryam Tahhan <mtahhan@redhat.com>
> >> Cc: xdp-hints@xdp-project.net
> >> Cc: netdev@vger.kernel.org
> >> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >> ---
> >>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> >>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> >>   2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> b/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> index b135daddad3a..83c879aa3581 100644
> >> --- a/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> +++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> @@ -12,9 +12,31 @@ struct {
> >>       __type(value, __u32);
> >>   } xsk SEC(".maps");
> >> +extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx)
> >> __ksym;
> >> +extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> >> +
> >>   SEC("xdp")
> >>   int rx(struct xdp_md *ctx)
> >>   {
> >> +    void *data, *data_meta;
> >> +    __u32 rx_timestamp;
> >> +    int ret;
> >> +
> >> +    if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {
>
> In current veth implementation, bpf_xdp_metadata_have_rx_timestamp()
> will always return true here.
>
> In the case of hardware timestamps, not every packet will contain a
> hardware timestamp.  (See my/Maryam ixgbe patch, where timestamps are
> read via HW device register, which isn't fast, and HW only support this
> for timesync protocols like PTP).
>
> How do you imagine we can extend this?

I'm always returning true for simplicity. In the real world, this
bytecode will look at the descriptors and return true/false depending
on whether the info is there or not.

> >> +        ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));
>
> IMHO sizeof() should come from a struct describing data_meta area see:
>
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62

I guess I should've used pointers for the return type instead, something like:

extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;

{
   ...
    __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
    if (rx_timestamp) {
        bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
        __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
    }
}

Does that look better?

> >> +        if (ret != 0)
> >> +            return XDP_DROP;
> >> +
> >> +        data = (void *)(long)ctx->data;
> >> +        data_meta = (void *)(long)ctx->data_meta;
> >> +
> >> +        if (data_meta + sizeof(__u32) > data)
> >> +            return XDP_DROP;
> >> +
> >> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
>
> So, this approach first stores hints on some other memory location, and
> then need to copy over information into data_meta area. That isn't good
> from a performance perspective.
>
> My idea is to store it in the final data_meta destination immediately.

This approach doesn't have to store the hints in the other memory
location. xdp_buff->priv can point to the real hw descriptor and the
kfunc can have a bytecode that extracts the data from the hw
descriptors. For this particular RFC, we can think that 'skb' is that
hw descriptor for veth driver.

> Do notice that in my approach, the existing ethtool config setting and
> socket options (for timestamps) still apply.  Thus, each individual
> hardware hint are already configurable. Thus we already have a config
> interface. I do acknowledge, that in-case a feature is disabled it still
> takes up space in data_meta areas, but importantly it is NOT stored into
> the area (for performance reasons).

That should be the case with this rfc as well, isn't it? Worst case
scenario, that kfunc bytecode can explicitly check ethtool options and
return false if it's disabled?

> >> +    }
> >
> > Thanks for the patches.  I took a quick look at patch 1 and 2 but
> > haven't had a chance to look at the implementation details (eg.
> > KF_UNROLL...etc), yet.
> >
>
> Yes, thanks for the patches, even-though I don't agree with the
> approach, at-least until my concerns/use-case can be resolved.
> IMHO the best way to convince people is through code. So, thank you for
> the effort.  Hopefully we can use some of these ideas and I can also
> change/adjust my XDP-hints ideas to incorporate some of this :-)

Thank you for the feedback as well, appreciate it!
Definitely, looking forward to a v2 from you with some more clarity on
how those btf ids are handled by the bpf/af_xdp side!

> > Overall (with the example here) looks promising.  There is a lot of
> > flexibility on whether the xdp prog needs any hint at all, which hint it
> > needs, and how to store it.
> >
>
> I do see the advantage that XDP prog only populates metadata it needs.
> But how can we use/access this in __xdp_build_skb_from_frame() ?

I don't think __xdp_build_skb_from_frame is automagically solved by
either proposal?
For this proposal, there has to be some expected kernel metadata
format that bpf programs will prepare and the kernel will understand?
Think of it like xdp_hints_common from your proposal; the program will
have to put together xdp_hints_skb into xdp metadata with the parts
that can be populated into skb by the kernel.

For your btf ids proposal, it seems there has to be some extra kernel
code to parse all possible driver btf_if formats and copy the
metadata?





> >> +
> >>       return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
> >>   }
> >> diff --git a/tools/testing/selftests/bpf/xskxceiver.c
> >> b/tools/testing/selftests/bpf/xskxceiver.c
> >> index 066bd691db13..ce82c89a432e 100644
> >> --- a/tools/testing/selftests/bpf/xskxceiver.c
> >> +++ b/tools/testing/selftests/bpf/xskxceiver.c
> >> @@ -871,7 +871,9 @@ static bool is_offset_correct(struct xsk_umem_info
> >> *umem, struct pkt_stream *pkt
> >>   static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr,
> >> u32 len)
> >>   {
> >>       void *data = xsk_umem__get_data(buffer, addr);
> >> +    void *data_meta = data - sizeof(__u32);
> >>       struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct
> >> ethhdr));
> >> +    __u32 rx_timestamp = 0;
> >>       if (!pkt) {
> >>           ksft_print_msg("[%s] too many packets received\n", __func__);
> >> @@ -907,6 +909,13 @@ static bool is_pkt_valid(struct pkt *pkt, void
> >> *buffer, u64 addr, u32 len)
> >>           return false;
> >>       }
> >> +    memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));
>
> I acknowledge that it is too extensive to add to this patch, but in my
> AF_XDP-interaction example[1], I'm creating a struct xdp_hints_rx_time
> that gets BTF exported[1][2] to the userspace application, and userspace
> decodes the BTF and gets[3] a xsk_btf_member struct for members that
> simply contains a offset+size to read from.
>
> [1]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L47-L51
>
> [2]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
>
> [3]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_user.c#L123-L129
>
> >> +    if (rx_timestamp == 0) {
> >> +        ksft_print_msg("Invalid metadata received: ");
> >> +        ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
> >> +        return false;
> >> +    }
> >> +
> >>       return true;
> >>   }
> >
>
> Looking forward to collaborate :-)
> --Jesper
>
Alexander Lobakin Oct. 31, 2022, 2:20 p.m. UTC | #4
From: Stanislav Fomichev <sdf@google.com>
Date: Fri, 28 Oct 2022 11:46:14 -0700

> On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
> >
> >
> > On 28/10/2022 08.22, Martin KaFai Lau wrote:
> > > On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> > >> Example on how the metadata is prepared from the BPF context
> > >> and consumed by AF_XDP:
> > >>
> > >> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> > >>    if not, I'm assuming verifier will remove this "if (0)" branch
> > >> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> > >>    the program has to bpf_xdp_adjust_meta+memcpy it;
> > >>    maybe returning a pointer is better?
> > >> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> > >>    makes sure timestamp is not empty
> > >> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> > >>
> > >> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > >> Cc: Jakub Kicinski <kuba@kernel.org>
> > >> Cc: Willem de Bruijn <willemb@google.com>
> > >> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > >> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > >> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > >> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > >> Cc: Maryam Tahhan <mtahhan@redhat.com>
> > >> Cc: xdp-hints@xdp-project.net
> > >> Cc: netdev@vger.kernel.org
> > >> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > >> ---
> > >>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> > >>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> > >>   2 files changed, 44 insertions(+), 1 deletion(-)

[...]

> > IMHO sizeof() should come from a struct describing data_meta area see:
> >
> > https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
> 
> I guess I should've used pointers for the return type instead, something like:
> 
> extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> 
> {
>    ...
>     __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>     if (rx_timestamp) {
>         bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
>         __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
>     }
> }
> 
> Does that look better?

I guess it will then be resolved to a direct store, right?
I mean, to smth like

	if (rx_timestamp)
		*(u32 *)data_meta = *rx_timestamp;

, where *rx_timestamp points directly to the Rx descriptor?

> 
> > >> +        if (ret != 0)
> > >> +            return XDP_DROP;
> > >> +
> > >> +        data = (void *)(long)ctx->data;
> > >> +        data_meta = (void *)(long)ctx->data_meta;
> > >> +
> > >> +        if (data_meta + sizeof(__u32) > data)
> > >> +            return XDP_DROP;
> > >> +
> > >> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> > >> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> >
> > So, this approach first stores hints on some other memory location, and
> > then need to copy over information into data_meta area. That isn't good
> > from a performance perspective.
> >
> > My idea is to store it in the final data_meta destination immediately.
> 
> This approach doesn't have to store the hints in the other memory
> location. xdp_buff->priv can point to the real hw descriptor and the
> kfunc can have a bytecode that extracts the data from the hw
> descriptors. For this particular RFC, we can think that 'skb' is that
> hw descriptor for veth driver.

I really do think intermediate stores can be avoided with this
approach.
Oh, and BTW, if we plan to use a particular Hint in the BPF program
only, there's no need to place it in the metadata area at all, is
there? You only want to get it in your code, so just retrieve it to
the stack and that's it. data_meta is only for cases when you want
hints to appear in AF_XDP.

> 
> > Do notice that in my approach, the existing ethtool config setting and
> > socket options (for timestamps) still apply.  Thus, each individual
> > hardware hint are already configurable. Thus we already have a config
> > interface. I do acknowledge, that in-case a feature is disabled it still
> > takes up space in data_meta areas, but importantly it is NOT stored into
> > the area (for performance reasons).
> 
> That should be the case with this rfc as well, isn't it? Worst case
> scenario, that kfunc bytecode can explicitly check ethtool options and
> return false if it's disabled?

(to Jesper)

Once again, Ethtool idea doesn't work. Let's say you have roughly
50% of frames to be consumed by XDP, other 50% go to skb path and
the stack. In skb, I want as many HW data as possible: checksums,
hash and so on. Let's say in XDP prog I want only timestamp. What's
then? Disable everything but stamp and kill skb path? Enable
everything and kill XDP path?
Stanislav's approach allows you to request only fields you need from
the BPF prog directly, I don't see any reasons for adding one more
layer "oh no I won't give you checksum because it's disabled via
Ethtool".
Maybe I get something wrong, pls explain then :P

> 
> > >> +    }
> > >
> > > Thanks for the patches.  I took a quick look at patch 1 and 2 but
> > > haven't had a chance to look at the implementation details (eg.
> > > KF_UNROLL...etc), yet.
> > >
> >
> > Yes, thanks for the patches, even-though I don't agree with the
> > approach, at-least until my concerns/use-case can be resolved.
> > IMHO the best way to convince people is through code. So, thank you for
> > the effort.  Hopefully we can use some of these ideas and I can also
> > change/adjust my XDP-hints ideas to incorporate some of this :-)
> 
> Thank you for the feedback as well, appreciate it!
> Definitely, looking forward to a v2 from you with some more clarity on
> how those btf ids are handled by the bpf/af_xdp side!
> 
> > > Overall (with the example here) looks promising.  There is a lot of
> > > flexibility on whether the xdp prog needs any hint at all, which hint it
> > > needs, and how to store it.
> > >
> >
> > I do see the advantage that XDP prog only populates metadata it needs.
> > But how can we use/access this in __xdp_build_skb_from_frame() ?
> 
> I don't think __xdp_build_skb_from_frame is automagically solved by
> either proposal?

It's solved in my approach[0], so that __xdp_build_skb_from_frame()
are automatically get skb fields populated with the metadata if
available. But I always use a fixed generic structure, which can't
compete with your series in terms of flexibility (but solves stuff
like inter-vendor redirects and so on).
So in general I feel like there should be 2 options for metadata for
users:

1) I use one particular vendor and I always compile AF_XDP programs
   from fresh source code. I need to read/write only fields I want
   to. I'd go with kfunc or kptr here (but I don't think BPF progs
   should parse descriptor formats on their own, so your unroll NDO
   approach looks optimal for me for that case);
2) I use multiple vendors, pre-compiled AF_XDP programs or just old
   source code, I use veth and/or cpumap. So it's sorta
   back-forward-left-right-compatibility path. So here we could just
   use a fixed structure.

> For this proposal, there has to be some expected kernel metadata
> format that bpf programs will prepare and the kernel will understand?
> Think of it like xdp_hints_common from your proposal; the program will
> have to put together xdp_hints_skb into xdp metadata with the parts
> that can be populated into skb by the kernel.
> 
> For your btf ids proposal, it seems there has to be some extra kernel
> code to parse all possible driver btf_if formats and copy the
> metadata?

That's why I define a "generic" struct, so that its consumers
wouldn't have to if-else through a dozen of possible IDs :P

[...]

Great stuff from my PoV, I'd probably like to have some helpers for
writing this new NDO, so that small vendors wouldn't be afraid of
implementing it as Jakub mentioned. But still sorta optimal and
elegant for me, I'm not sure I want to post a "demo version" of my
series anymore :D
I feel like this way + one more "everything-compat-fixed" couple
would satisfy most of potential users.

Thanks,
Olek
Alexander Lobakin Oct. 31, 2022, 2:29 p.m. UTC | #5
From: Alexander Lobakin <alexandr.lobakin@intel.com>
Date: Mon, 31 Oct 2022 15:20:32 +0100

> From: Stanislav Fomichev <sdf@google.com>
> Date: Fri, 28 Oct 2022 11:46:14 -0700
> 
> > On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:

[...]

> > I don't think __xdp_build_skb_from_frame is automagically solved by
> > either proposal?
> 
> It's solved in my approach[0], so that __xdp_build_skb_from_frame()

Yeah sure forget to paste the link, why doncha?

[0] https://github.com/alobakin/linux/commit/a43a9d6895fa11f182becf3a7c202eeceb45a16a

> are automatically get skb fields populated with the metadata if
> available. But I always use a fixed generic structure, which can't
> compete with your series in terms of flexibility (but solves stuff
> like inter-vendor redirects and so on).

[...]

> Thanks,
> Olek

Thanks,
Olek
Stanislav Fomichev Oct. 31, 2022, 5 p.m. UTC | #6
On Mon, Oct 31, 2022 at 7:22 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> From: Stanislav Fomichev <sdf@google.com>
> Date: Fri, 28 Oct 2022 11:46:14 -0700
>
> > On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> > >
> > >
> > > On 28/10/2022 08.22, Martin KaFai Lau wrote:
> > > > On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> > > >> Example on how the metadata is prepared from the BPF context
> > > >> and consumed by AF_XDP:
> > > >>
> > > >> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> > > >>    if not, I'm assuming verifier will remove this "if (0)" branch
> > > >> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> > > >>    the program has to bpf_xdp_adjust_meta+memcpy it;
> > > >>    maybe returning a pointer is better?
> > > >> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> > > >>    makes sure timestamp is not empty
> > > >> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> > > >>
> > > >> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > > >> Cc: Jakub Kicinski <kuba@kernel.org>
> > > >> Cc: Willem de Bruijn <willemb@google.com>
> > > >> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > > >> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > > >> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > >> Cc: Maryam Tahhan <mtahhan@redhat.com>
> > > >> Cc: xdp-hints@xdp-project.net
> > > >> Cc: netdev@vger.kernel.org
> > > >> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > >> ---
> > > >>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> > > >>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> > > >>   2 files changed, 44 insertions(+), 1 deletion(-)
>
> [...]
>
> > > IMHO sizeof() should come from a struct describing data_meta area see:
> > >
> > > https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
> >
> > I guess I should've used pointers for the return type instead, something like:
> >
> > extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> >
> > {
> >    ...
> >     __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >     if (rx_timestamp) {
> >         bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
> >         __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
> >     }
> > }
> >
> > Does that look better?
>
> I guess it will then be resolved to a direct store, right?
> I mean, to smth like
>
>         if (rx_timestamp)
>                 *(u32 *)data_meta = *rx_timestamp;
>
> , where *rx_timestamp points directly to the Rx descriptor?

Right. I should've used that form from the beginning, that memcpy is
confusing :-(

> >
> > > >> +        if (ret != 0)
> > > >> +            return XDP_DROP;
> > > >> +
> > > >> +        data = (void *)(long)ctx->data;
> > > >> +        data_meta = (void *)(long)ctx->data_meta;
> > > >> +
> > > >> +        if (data_meta + sizeof(__u32) > data)
> > > >> +            return XDP_DROP;
> > > >> +
> > > >> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> > > >> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> > >
> > > So, this approach first stores hints on some other memory location, and
> > > then need to copy over information into data_meta area. That isn't good
> > > from a performance perspective.
> > >
> > > My idea is to store it in the final data_meta destination immediately.
> >
> > This approach doesn't have to store the hints in the other memory
> > location. xdp_buff->priv can point to the real hw descriptor and the
> > kfunc can have a bytecode that extracts the data from the hw
> > descriptors. For this particular RFC, we can think that 'skb' is that
> > hw descriptor for veth driver.
>
> I really do think intermediate stores can be avoided with this
> approach.
> Oh, and BTW, if we plan to use a particular Hint in the BPF program
> only, there's no need to place it in the metadata area at all, is
> there? You only want to get it in your code, so just retrieve it to
> the stack and that's it. data_meta is only for cases when you want
> hints to appear in AF_XDP.

Correct.

> > > Do notice that in my approach, the existing ethtool config setting and
> > > socket options (for timestamps) still apply.  Thus, each individual
> > > hardware hint are already configurable. Thus we already have a config
> > > interface. I do acknowledge, that in-case a feature is disabled it still
> > > takes up space in data_meta areas, but importantly it is NOT stored into
> > > the area (for performance reasons).
> >
> > That should be the case with this rfc as well, isn't it? Worst case
> > scenario, that kfunc bytecode can explicitly check ethtool options and
> > return false if it's disabled?
>
> (to Jesper)
>
> Once again, Ethtool idea doesn't work. Let's say you have roughly
> 50% of frames to be consumed by XDP, other 50% go to skb path and
> the stack. In skb, I want as many HW data as possible: checksums,
> hash and so on. Let's say in XDP prog I want only timestamp. What's
> then? Disable everything but stamp and kill skb path? Enable
> everything and kill XDP path?
> Stanislav's approach allows you to request only fields you need from
> the BPF prog directly, I don't see any reasons for adding one more
> layer "oh no I won't give you checksum because it's disabled via
> Ethtool".
> Maybe I get something wrong, pls explain then :P

Agree, good point.

> > > >> +    }
> > > >
> > > > Thanks for the patches.  I took a quick look at patch 1 and 2 but
> > > > haven't had a chance to look at the implementation details (eg.
> > > > KF_UNROLL...etc), yet.
> > > >
> > >
> > > Yes, thanks for the patches, even-though I don't agree with the
> > > approach, at-least until my concerns/use-case can be resolved.
> > > IMHO the best way to convince people is through code. So, thank you for
> > > the effort.  Hopefully we can use some of these ideas and I can also
> > > change/adjust my XDP-hints ideas to incorporate some of this :-)
> >
> > Thank you for the feedback as well, appreciate it!
> > Definitely, looking forward to a v2 from you with some more clarity on
> > how those btf ids are handled by the bpf/af_xdp side!
> >
> > > > Overall (with the example here) looks promising.  There is a lot of
> > > > flexibility on whether the xdp prog needs any hint at all, which hint it
> > > > needs, and how to store it.
> > > >
> > >
> > > I do see the advantage that XDP prog only populates metadata it needs.
> > > But how can we use/access this in __xdp_build_skb_from_frame() ?
> >
> > I don't think __xdp_build_skb_from_frame is automagically solved by
> > either proposal?
>
> It's solved in my approach[0], so that __xdp_build_skb_from_frame()
> are automatically get skb fields populated with the metadata if
> available. But I always use a fixed generic structure, which can't
> compete with your series in terms of flexibility (but solves stuff
> like inter-vendor redirects and so on).
> So in general I feel like there should be 2 options for metadata for
> users:
>
> 1) I use one particular vendor and I always compile AF_XDP programs
>    from fresh source code. I need to read/write only fields I want
>    to. I'd go with kfunc or kptr here (but I don't think BPF progs
>    should parse descriptor formats on their own, so your unroll NDO
>    approach looks optimal for me for that case);
> 2) I use multiple vendors, pre-compiled AF_XDP programs or just old
>    source code, I use veth and/or cpumap. So it's sorta
>    back-forward-left-right-compatibility path. So here we could just
>    use a fixed structure.

For (2) I really like Toke's suggestion about some extra helper that
prepares the metadata that the kernel path will later on be able to
understand.
The only downside I see is that it has to be called explicitly, but if
we assume that libxdp can also abstract this detail, doesn't sound
like a huge issue to me?

> > For this proposal, there has to be some expected kernel metadata
> > format that bpf programs will prepare and the kernel will understand?
> > Think of it like xdp_hints_common from your proposal; the program will
> > have to put together xdp_hints_skb into xdp metadata with the parts
> > that can be populated into skb by the kernel.
> >
> > For your btf ids proposal, it seems there has to be some extra kernel
> > code to parse all possible driver btf_if formats and copy the
> > metadata?
>
> That's why I define a "generic" struct, so that its consumers
> wouldn't have to if-else through a dozen of possible IDs :P
>
> [...]
>
> Great stuff from my PoV, I'd probably like to have some helpers for
> writing this new NDO, so that small vendors wouldn't be afraid of
> implementing it as Jakub mentioned. But still sorta optimal and
> elegant for me, I'm not sure I want to post a "demo version" of my
> series anymore :D
> I feel like this way + one more "everything-compat-fixed" couple
> would satisfy most of potential users.

Awesome, thanks for the review and the feedback!
Jesper Dangaard Brouer Nov. 1, 2022, 1:18 p.m. UTC | #7
On 31/10/2022 18.00, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 7:22 AM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
>>
>> From: Stanislav Fomichev <sdf@google.com>
>> Date: Fri, 28 Oct 2022 11:46:14 -0700
>>
>>> On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
>>> <jbrouer@redhat.com> wrote:
>>>>
>>>>
>>>> On 28/10/2022 08.22, Martin KaFai Lau wrote:
>>>>> On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
>>>>>> Example on how the metadata is prepared from the BPF context
>>>>>> and consumed by AF_XDP:
>>>>>>
>>>>>> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
>>>>>>     if not, I'm assuming verifier will remove this "if (0)" branch
>>>>>> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
>>>>>>     the program has to bpf_xdp_adjust_meta+memcpy it;
>>>>>>     maybe returning a pointer is better?
>>>>>> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
>>>>>>     makes sure timestamp is not empty
>>>>>> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
>>>>>>
>>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>>> Cc: Willem de Bruijn <willemb@google.com>
>>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>>>>> Cc: xdp-hints@xdp-project.net
>>>>>> Cc: netdev@vger.kernel.org
>>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>>> ---
>>>>>>    .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
>>>>>>    tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
>>>>>>    2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> [...]
>>
>>>> IMHO sizeof() should come from a struct describing data_meta area see:
>>>>
>>>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
>>>
>>> I guess I should've used pointers for the return type instead, something like:
>>>
>>> extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
>>>
>>> {
>>>     ...
>>>      __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>>>      if (rx_timestamp) {
>>>          bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
>>>          __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
>>>      }
>>> }
>>>
>>> Does that look better?
>>
>> I guess it will then be resolved to a direct store, right?
>> I mean, to smth like
>>
>>          if (rx_timestamp)
>>                  *(u32 *)data_meta = *rx_timestamp;
>>
>> , where *rx_timestamp points directly to the Rx descriptor?
> 
> Right. I should've used that form from the beginning, that memcpy is
> confusing :-(
> 
>>>
>>>>>> +        if (ret != 0)
>>>>>> +            return XDP_DROP;
>>>>>> +
>>>>>> +        data = (void *)(long)ctx->data;
>>>>>> +        data_meta = (void *)(long)ctx->data_meta;
>>>>>> +
>>>>>> +        if (data_meta + sizeof(__u32) > data)
>>>>>> +            return XDP_DROP;
>>>>>> +
>>>>>> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>>>>>> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
>>>>
>>>> So, this approach first stores hints on some other memory location, and
>>>> then need to copy over information into data_meta area. That isn't good
>>>> from a performance perspective.
>>>>
>>>> My idea is to store it in the final data_meta destination immediately.
>>>
>>> This approach doesn't have to store the hints in the other memory
>>> location. xdp_buff->priv can point to the real hw descriptor and the
>>> kfunc can have a bytecode that extracts the data from the hw
>>> descriptors. For this particular RFC, we can think that 'skb' is that
>>> hw descriptor for veth driver.

Once you point xdp_buff->priv to the real hw descriptor, then we also
need to have some additional data/pointers to NIC hardware info + HW
setup state. You will hit some of the same challenges as John, like
hardware/firmware revisions and chip models, that Jakub pointed out.
Because your approach stays with the driver code, I guess it will be a
bit easier code wise. Maybe we can store data/pointer needed for this in
xdp_rxq_info (xdp->rxq).

I would need to see some code that juggling this HW NCI state from the
kfunc expansion to be convinced this is the right approach.


>>
>> I really do think intermediate stores can be avoided with this
>> approach.
>> Oh, and BTW, if we plan to use a particular Hint in the BPF program
>> only, there's no need to place it in the metadata area at all, is
>> there? You only want to get it in your code, so just retrieve it to
>> the stack and that's it. data_meta is only for cases when you want
>> hints to appear in AF_XDP.
> 
> Correct.

It is not *only* AF_XDP that need data stored in data_meta.

The stores data_meta are also needed for veth and cpumap, because the HW
descriptor is "out-of-scope" and thus no-longer available.


> 
>>>> Do notice that in my approach, the existing ethtool config setting and
>>>> socket options (for timestamps) still apply.  Thus, each individual
>>>> hardware hint are already configurable. Thus we already have a config
>>>> interface. I do acknowledge, that in-case a feature is disabled it still
>>>> takes up space in data_meta areas, but importantly it is NOT stored into
>>>> the area (for performance reasons).
>>>
>>> That should be the case with this rfc as well, isn't it? Worst case
>>> scenario, that kfunc bytecode can explicitly check ethtool options and
>>> return false if it's disabled?
>>
>> (to Jesper)
>>
>> Once again, Ethtool idea doesn't work. Let's say you have roughly
>> 50% of frames to be consumed by XDP, other 50% go to skb path and
>> the stack. In skb, I want as many HW data as possible: checksums,
>> hash and so on. Let's say in XDP prog I want only timestamp. What's
>> then? Disable everything but stamp and kill skb path? Enable
>> everything and kill XDP path?
>> Stanislav's approach allows you to request only fields you need from
>> the BPF prog directly, I don't see any reasons for adding one more
>> layer "oh no I won't give you checksum because it's disabled via
>> Ethtool".
>> Maybe I get something wrong, pls explain then :P
> 
> Agree, good point.

Stanislav's (and John's proposal) is definitely focused and addressing
something else than my patchset.

I optimized the XDP-hints population (for i40e) down to 6 nanosec (on
3.6 GHz CPU = 21 cycles).  Plus, I added an ethtool switch to turn it
off for those XDP users that cannot live with this overhead.  Hoping
this would be fast-enough such that we didn't have to add this layer.
(If XDP returns XDP_PASS then this decoded info can be used for the SKB
creation. Thus, this is essentially just moving decoding RX-desc a bit
earlier in the the driver).

One of my use-cases is getting RX-checksum support in xdp_frame's and
transferring this to SKB creation time.  I have done a number of
measurements[1] to find out how much we can gain of performance for UDP
packets (1500 bytes) with/without RX-checksum.  Initial result showed I
saved 91 nanosec, but that was avoiding to touching data.  Doing full
userspace UDP delivery with a copy (or copy+checksum) showed the real
save was 54 nanosec.  In this context, the 6 nanosec was very small.
Thus, I didn't choose to pursue a BPF layer for individual fields.

  [1]
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org

Sure it is super cool if we can create this BPF layer that programmable
selects individual fields from the descriptor, and maybe we ALSO need that.
Could this layer could still be added after my patchset(?), as one could
disable the XDP-hints (via ethtool) and then use kfuncs/kptr to extract
only fields need by the specific XDP-prog use-case.
Could they also co-exist(?), kfuncs/kptr could extend the
xdp_hints_rx_common struct (in data_meta area) with more advanced
offload-hints and then update the BTF-ID (yes, BPF can already resolve
its own BTF-IDs from BPF-prog code).

Great to see all the discussions and different oppinons :-)
--Jesper
Stanislav Fomichev Nov. 1, 2022, 8:12 p.m. UTC | #8
On Tue, Nov 1, 2022 at 6:18 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 31/10/2022 18.00, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 7:22 AM Alexander Lobakin
> > <alexandr.lobakin@intel.com> wrote:
> >>
> >> From: Stanislav Fomichev <sdf@google.com>
> >> Date: Fri, 28 Oct 2022 11:46:14 -0700
> >>
> >>> On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> >>> <jbrouer@redhat.com> wrote:
> >>>>
> >>>>
> >>>> On 28/10/2022 08.22, Martin KaFai Lau wrote:
> >>>>> On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> >>>>>> Example on how the metadata is prepared from the BPF context
> >>>>>> and consumed by AF_XDP:
> >>>>>>
> >>>>>> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> >>>>>>     if not, I'm assuming verifier will remove this "if (0)" branch
> >>>>>> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> >>>>>>     the program has to bpf_xdp_adjust_meta+memcpy it;
> >>>>>>     maybe returning a pointer is better?
> >>>>>> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> >>>>>>     makes sure timestamp is not empty
> >>>>>> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> >>>>>>
> >>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> >>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
> >>>>>> Cc: Willem de Bruijn <willemb@google.com>
> >>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> >>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> >>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
> >>>>>> Cc: xdp-hints@xdp-project.net
> >>>>>> Cc: netdev@vger.kernel.org
> >>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >>>>>> ---
> >>>>>>    .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> >>>>>>    tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> >>>>>>    2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> [...]
> >>
> >>>> IMHO sizeof() should come from a struct describing data_meta area see:
> >>>>
> >>>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
> >>>
> >>> I guess I should've used pointers for the return type instead, something like:
> >>>
> >>> extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> >>>
> >>> {
> >>>     ...
> >>>      __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >>>      if (rx_timestamp) {
> >>>          bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
> >>>          __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
> >>>      }
> >>> }
> >>>
> >>> Does that look better?
> >>
> >> I guess it will then be resolved to a direct store, right?
> >> I mean, to smth like
> >>
> >>          if (rx_timestamp)
> >>                  *(u32 *)data_meta = *rx_timestamp;
> >>
> >> , where *rx_timestamp points directly to the Rx descriptor?
> >
> > Right. I should've used that form from the beginning, that memcpy is
> > confusing :-(
> >
> >>>
> >>>>>> +        if (ret != 0)
> >>>>>> +            return XDP_DROP;
> >>>>>> +
> >>>>>> +        data = (void *)(long)ctx->data;
> >>>>>> +        data_meta = (void *)(long)ctx->data_meta;
> >>>>>> +
> >>>>>> +        if (data_meta + sizeof(__u32) > data)
> >>>>>> +            return XDP_DROP;
> >>>>>> +
> >>>>>> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >>>>>> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> >>>>
> >>>> So, this approach first stores hints on some other memory location, and
> >>>> then need to copy over information into data_meta area. That isn't good
> >>>> from a performance perspective.
> >>>>
> >>>> My idea is to store it in the final data_meta destination immediately.
> >>>
> >>> This approach doesn't have to store the hints in the other memory
> >>> location. xdp_buff->priv can point to the real hw descriptor and the
> >>> kfunc can have a bytecode that extracts the data from the hw
> >>> descriptors. For this particular RFC, we can think that 'skb' is that
> >>> hw descriptor for veth driver.
>
> Once you point xdp_buff->priv to the real hw descriptor, then we also
> need to have some additional data/pointers to NIC hardware info + HW
> setup state. You will hit some of the same challenges as John, like
> hardware/firmware revisions and chip models, that Jakub pointed out.
> Because your approach stays with the driver code, I guess it will be a
> bit easier code wise. Maybe we can store data/pointer needed for this in
> xdp_rxq_info (xdp->rxq).
>
> I would need to see some code that juggling this HW NCI state from the
> kfunc expansion to be convinced this is the right approach.

I've replied to Martin in another thread, but that's a good point. We
might need to do locks while parsing the descriptors and converting to
kernel time, so maybe assuming that everything can be unrolled won't
stand 100%. OTOH, it seems like unrolled bytecode can (with some
quirks) always call into some driver specific c function...

I'm trying to convert a couple of drivers (without really testing the
implementation) and see whether there are any other big issues.


> >>
> >> I really do think intermediate stores can be avoided with this
> >> approach.
> >> Oh, and BTW, if we plan to use a particular Hint in the BPF program
> >> only, there's no need to place it in the metadata area at all, is
> >> there? You only want to get it in your code, so just retrieve it to
> >> the stack and that's it. data_meta is only for cases when you want
> >> hints to appear in AF_XDP.
> >
> > Correct.
>
> It is not *only* AF_XDP that need data stored in data_meta.
>
> The stores data_meta are also needed for veth and cpumap, because the HW
> descriptor is "out-of-scope" and thus no-longer available.
>
>
> >
> >>>> Do notice that in my approach, the existing ethtool config setting and
> >>>> socket options (for timestamps) still apply.  Thus, each individual
> >>>> hardware hint are already configurable. Thus we already have a config
> >>>> interface. I do acknowledge, that in-case a feature is disabled it still
> >>>> takes up space in data_meta areas, but importantly it is NOT stored into
> >>>> the area (for performance reasons).
> >>>
> >>> That should be the case with this rfc as well, isn't it? Worst case
> >>> scenario, that kfunc bytecode can explicitly check ethtool options and
> >>> return false if it's disabled?
> >>
> >> (to Jesper)
> >>
> >> Once again, Ethtool idea doesn't work. Let's say you have roughly
> >> 50% of frames to be consumed by XDP, other 50% go to skb path and
> >> the stack. In skb, I want as many HW data as possible: checksums,
> >> hash and so on. Let's say in XDP prog I want only timestamp. What's
> >> then? Disable everything but stamp and kill skb path? Enable
> >> everything and kill XDP path?
> >> Stanislav's approach allows you to request only fields you need from
> >> the BPF prog directly, I don't see any reasons for adding one more
> >> layer "oh no I won't give you checksum because it's disabled via
> >> Ethtool".
> >> Maybe I get something wrong, pls explain then :P
> >
> > Agree, good point.
>
> Stanislav's (and John's proposal) is definitely focused and addressing
> something else than my patchset.
>
> I optimized the XDP-hints population (for i40e) down to 6 nanosec (on
> 3.6 GHz CPU = 21 cycles).  Plus, I added an ethtool switch to turn it
> off for those XDP users that cannot live with this overhead.  Hoping
> this would be fast-enough such that we didn't have to add this layer.
> (If XDP returns XDP_PASS then this decoded info can be used for the SKB
> creation. Thus, this is essentially just moving decoding RX-desc a bit
> earlier in the the driver).
>
> One of my use-cases is getting RX-checksum support in xdp_frame's and
> transferring this to SKB creation time.  I have done a number of
> measurements[1] to find out how much we can gain of performance for UDP
> packets (1500 bytes) with/without RX-checksum.  Initial result showed I
> saved 91 nanosec, but that was avoiding to touching data.  Doing full
> userspace UDP delivery with a copy (or copy+checksum) showed the real
> save was 54 nanosec.  In this context, the 6 nanosec was very small.
> Thus, I didn't choose to pursue a BPF layer for individual fields.
>
>   [1]
> https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
>
> Sure it is super cool if we can create this BPF layer that programmable
> selects individual fields from the descriptor, and maybe we ALSO need that.
> Could this layer could still be added after my patchset(?), as one could
> disable the XDP-hints (via ethtool) and then use kfuncs/kptr to extract
> only fields need by the specific XDP-prog use-case.
> Could they also co-exist(?), kfuncs/kptr could extend the
> xdp_hints_rx_common struct (in data_meta area) with more advanced
> offload-hints and then update the BTF-ID (yes, BPF can already resolve
> its own BTF-IDs from BPF-prog code).
>
> Great to see all the discussions and different oppinons :-)
> --Jesper
>
Toke Høiland-Jørgensen Nov. 1, 2022, 10:23 p.m. UTC | #9
>>>>> So, this approach first stores hints on some other memory location, and
>>>>> then need to copy over information into data_meta area. That isn't good
>>>>> from a performance perspective.
>>>>>
>>>>> My idea is to store it in the final data_meta destination immediately.
>>>>
>>>> This approach doesn't have to store the hints in the other memory
>>>> location. xdp_buff->priv can point to the real hw descriptor and the
>>>> kfunc can have a bytecode that extracts the data from the hw
>>>> descriptors. For this particular RFC, we can think that 'skb' is that
>>>> hw descriptor for veth driver.
>
> Once you point xdp_buff->priv to the real hw descriptor, then we also
> need to have some additional data/pointers to NIC hardware info + HW
> setup state. You will hit some of the same challenges as John, like
> hardware/firmware revisions and chip models, that Jakub pointed out.
> Because your approach stays with the driver code, I guess it will be a
> bit easier code wise. Maybe we can store data/pointer needed for this in
> xdp_rxq_info (xdp->rxq).
>
> I would need to see some code that juggling this HW NCI state from the
> kfunc expansion to be convinced this is the right approach.

+1 on needing to see this working for the actual metadata we want to
support, but I think the kfunc approach otherwise shows promise; see
below.

[...]

> Sure it is super cool if we can create this BPF layer that programmable
> selects individual fields from the descriptor, and maybe we ALSO need that.
> Could this layer could still be added after my patchset(?), as one could
> disable the XDP-hints (via ethtool) and then use kfuncs/kptr to extract
> only fields need by the specific XDP-prog use-case.
> Could they also co-exist(?), kfuncs/kptr could extend the
> xdp_hints_rx_common struct (in data_meta area) with more advanced
> offload-hints and then update the BTF-ID (yes, BPF can already resolve
> its own BTF-IDs from BPF-prog code).

I actually think the two approaches are more similar than they appear
from a user-facing API perspective. Or at least they should be.

What I mean is, that with the BTF-ID approach, we still expect people to
write code like (from Stanislav's example in the other xdp_hints thread[0]):

If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly
populated at runtime by libbpf? */) {
  // do something with rx_timestamp
  // also, handle xdp_hints_ixgbe and then xdp_hints_common ?
} else if (ctx_hints_btf_id == xdp_hints_ixgbe) {
  // do something else
  // plus explicitly handle xdp_hints_common here?
} else {
  // handle xdp_hints_common
}

whereas with kfuncs (from this thread) this becomes:

if (xdp_metadata_rx_timestamp_exists(ctx))
  timestamp = xdp_metadata_rx_timestamp(ctx);


We can hide the former behind CO-RE macros to make it look like the
latter. But because we're just exposing the BTF IDs, people can in fact
just write code like the example above (directly checking the BTF IDs),
and that will work fine, but has a risk of leading to a proliferation of
device-specific XDP programs. Whereas with kfuncs we keep all this stuff
internal to the kernel (inside the kfuncs), making it much easier to
change it later.

Quoting yourself from the other thread[1]:

> In this patchset I'm trying to balance the different users. And via BTF
> I'm trying hard not to create more UAPI (e.g. more fixed fields avail in
> xdp_md that we cannot get rid of). And trying to add driver flexibility
> on-top of the common struct.  This flexibility seems to be stalling the
> patchset as we haven't found the perfect way to express this (yet) given
> BTF layout is per driver.

With kfuncs we kinda sidestep this issue because the kernel can handle
the per-driver specialisation by the unrolling trick. The drawback being
that programs will be tied to a particular device if they are using
metadata, but I think that's an acceptable trade-off.

-Toke

[0] https://lore.kernel.org/r/CAKH8qBuYVk7QwVOSYrhMNnaKFKGd7M9bopDyNp6-SnN6hSeTDQ@mail.gmail.com
[1] https://lore.kernel.org/r/ad360933-953a-7a99-5057-4d452a9a6005@redhat.com
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c b/tools/testing/selftests/bpf/progs/xskxceiver.c
index b135daddad3a..83c879aa3581 100644
--- a/tools/testing/selftests/bpf/progs/xskxceiver.c
+++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
@@ -12,9 +12,31 @@  struct {
 	__type(value, __u32);
 } xsk SEC(".maps");
 
+extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx) __ksym;
+extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
+
 SEC("xdp")
 int rx(struct xdp_md *ctx)
 {
+	void *data, *data_meta;
+	__u32 rx_timestamp;
+	int ret;
+
+	if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {
+		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));
+		if (ret != 0)
+			return XDP_DROP;
+
+		data = (void *)(long)ctx->data;
+		data_meta = (void *)(long)ctx->data_meta;
+
+		if (data_meta + sizeof(__u32) > data)
+			return XDP_DROP;
+
+		rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+		__builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
+	}
+
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index 066bd691db13..ce82c89a432e 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -871,7 +871,9 @@  static bool is_offset_correct(struct xsk_umem_info *umem, struct pkt_stream *pkt
 static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
 {
 	void *data = xsk_umem__get_data(buffer, addr);
+	void *data_meta = data - sizeof(__u32);
 	struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct ethhdr));
+	__u32 rx_timestamp = 0;
 
 	if (!pkt) {
 		ksft_print_msg("[%s] too many packets received\n", __func__);
@@ -907,6 +909,13 @@  static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
 		return false;
 	}
 
+	memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));
+	if (rx_timestamp == 0) {
+		ksft_print_msg("Invalid metadata received: ");
+		ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
+		return false;
+	}
+
 	return true;
 }
 
@@ -1331,6 +1340,7 @@  static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 	u64 umem_sz = ifobject->umem->num_frames * ifobject->umem->frame_size;
 	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
 	LIBBPF_OPTS(bpf_xdp_query_opts, opts);
+	LIBBPF_OPTS(bpf_object_open_opts, open_opts);
 	int ret, ifindex;
 	void *bufs;
 
@@ -1340,10 +1350,21 @@  static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 	if (!ifindex)
 		exit_with_error(errno);
 
-	ifobject->bpf_obj = xskxceiver__open_and_load();
+	open_opts.prog_ifindex = ifindex;
+
+	ifobject->bpf_obj = xskxceiver__open_opts(&open_opts);
 	if (libbpf_get_error(ifobject->bpf_obj))
 		exit_with_error(libbpf_get_error(ifobject->bpf_obj));
 
+	ret = bpf_program__set_flags(bpf_object__find_program_by_name(ifobject->bpf_obj->obj, "rx"),
+				     BPF_F_XDP_HAS_METADATA);
+	if (ret < 0)
+		exit_with_error(ret);
+
+	ret = xskxceiver__load(ifobject->bpf_obj);
+	if (ret < 0)
+		exit_with_error(ret);
+
 	if (ifobject->umem->unaligned_mode)
 		mmap_flags |= MAP_HUGETLB;