diff mbox series

[bpf-next,05/11] veth: Support rx timestamp metadata for xdp

Message ID 20221115030210.3159213-6-sdf@google.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series xdp: hints via kfuncs | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 4 maintainers not CCed: hawk@kernel.org davem@davemloft.net pabeni@redhat.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 87 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-9 success Logs for set-matrix

Commit Message

Stanislav Fomichev Nov. 15, 2022, 3:02 a.m. UTC
The goal is to enable end-to-end testing of the metadata
for AF_XDP. Current rx_timestamp kfunc returns current
time which should be enough to exercise this new functionality.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Toke Høiland-Jørgensen Nov. 15, 2022, 4:17 p.m. UTC | #1
Stanislav Fomichev <sdf@google.com> writes:

> The goal is to enable end-to-end testing of the metadata
> for AF_XDP. Current rx_timestamp kfunc returns current
> time which should be enough to exercise this new functionality.
>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  drivers/net/veth.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 2a4592780141..c626580a2294 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -25,6 +25,7 @@
>  #include <linux/filter.h>
>  #include <linux/ptr_ring.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/bpf_patch.h>
>  #include <linux/net_tstamp.h>
>  
>  #define DRV_NAME	"veth"
> @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  	}
>  }
>  
> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> +			      struct bpf_patch *patch)
> +{
> +	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> +		/* return true; */
> +		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> +	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> +		/* return ktime_get_mono_fast_ns(); */
> +		bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> +	}
> +}

So these look reasonable enough, but would be good to see some examples
of kfunc implementations that don't just BPF_CALL to a kernel function
(with those helper wrappers we were discussing before).

-Toke
Stanislav Fomichev Nov. 15, 2022, 6:37 p.m. UTC | #2
On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > The goal is to enable end-to-end testing of the metadata
> > for AF_XDP. Current rx_timestamp kfunc returns current
> > time which should be enough to exercise this new functionality.
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  drivers/net/veth.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 2a4592780141..c626580a2294 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -25,6 +25,7 @@
> >  #include <linux/filter.h>
> >  #include <linux/ptr_ring.h>
> >  #include <linux/bpf_trace.h>
> > +#include <linux/bpf_patch.h>
> >  #include <linux/net_tstamp.h>
> >
> >  #define DRV_NAME     "veth"
> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       }
> >  }
> >
> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > +                           struct bpf_patch *patch)
> > +{
> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > +             /* return true; */
> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > +             /* return ktime_get_mono_fast_ns(); */
> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > +     }
> > +}
>
> So these look reasonable enough, but would be good to see some examples
> of kfunc implementations that don't just BPF_CALL to a kernel function
> (with those helper wrappers we were discussing before).

Let's maybe add them if/when needed as we add more metadata support?
xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
examples, so it shouldn't be a problem to resurrect them back at some
point?
Toke Høiland-Jørgensen Nov. 15, 2022, 10:46 p.m. UTC | #3
Stanislav Fomichev <sdf@google.com> writes:

> On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > The goal is to enable end-to-end testing of the metadata
>> > for AF_XDP. Current rx_timestamp kfunc returns current
>> > time which should be enough to exercise this new functionality.
>> >
>> > Cc: John Fastabend <john.fastabend@gmail.com>
>> > Cc: David Ahern <dsahern@gmail.com>
>> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
>> > Cc: Jakub Kicinski <kuba@kernel.org>
>> > Cc: Willem de Bruijn <willemb@google.com>
>> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>> > Cc: Maryam Tahhan <mtahhan@redhat.com>
>> > Cc: xdp-hints@xdp-project.net
>> > Cc: netdev@vger.kernel.org
>> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> > ---
>> >  drivers/net/veth.c | 14 ++++++++++++++
>> >  1 file changed, 14 insertions(+)
>> >
>> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> > index 2a4592780141..c626580a2294 100644
>> > --- a/drivers/net/veth.c
>> > +++ b/drivers/net/veth.c
>> > @@ -25,6 +25,7 @@
>> >  #include <linux/filter.h>
>> >  #include <linux/ptr_ring.h>
>> >  #include <linux/bpf_trace.h>
>> > +#include <linux/bpf_patch.h>
>> >  #include <linux/net_tstamp.h>
>> >
>> >  #define DRV_NAME     "veth"
>> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>> >       }
>> >  }
>> >
>> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>> > +                           struct bpf_patch *patch)
>> > +{
>> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>> > +             /* return true; */
>> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>> > +             /* return ktime_get_mono_fast_ns(); */
>> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>> > +     }
>> > +}
>>
>> So these look reasonable enough, but would be good to see some examples
>> of kfunc implementations that don't just BPF_CALL to a kernel function
>> (with those helper wrappers we were discussing before).
>
> Let's maybe add them if/when needed as we add more metadata support?
> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> examples, so it shouldn't be a problem to resurrect them back at some
> point?

Well, the reason I asked for them is that I think having to maintain the
BPF code generation in the drivers is probably the biggest drawback of
the kfunc approach, so it would be good to be relatively sure that we
can manage that complexity (via helpers) before we commit to this :)

-Toke
Stanislav Fomichev Nov. 16, 2022, 4:09 a.m. UTC | #4
On Tue, Nov 15, 2022 at 2:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > The goal is to enable end-to-end testing of the metadata
> >> > for AF_XDP. Current rx_timestamp kfunc returns current
> >> > time which should be enough to exercise this new functionality.
> >> >
> >> > Cc: John Fastabend <john.fastabend@gmail.com>
> >> > Cc: David Ahern <dsahern@gmail.com>
> >> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> >> > Cc: Jakub Kicinski <kuba@kernel.org>
> >> > Cc: Willem de Bruijn <willemb@google.com>
> >> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> >> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> >> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> >> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> >> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> >> > Cc: xdp-hints@xdp-project.net
> >> > Cc: netdev@vger.kernel.org
> >> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >> > ---
> >> >  drivers/net/veth.c | 14 ++++++++++++++
> >> >  1 file changed, 14 insertions(+)
> >> >
> >> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >> > index 2a4592780141..c626580a2294 100644
> >> > --- a/drivers/net/veth.c
> >> > +++ b/drivers/net/veth.c
> >> > @@ -25,6 +25,7 @@
> >> >  #include <linux/filter.h>
> >> >  #include <linux/ptr_ring.h>
> >> >  #include <linux/bpf_trace.h>
> >> > +#include <linux/bpf_patch.h>
> >> >  #include <linux/net_tstamp.h>
> >> >
> >> >  #define DRV_NAME     "veth"
> >> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >> >       }
> >> >  }
> >> >
> >> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> >> > +                           struct bpf_patch *patch)
> >> > +{
> >> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> >> > +             /* return true; */
> >> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> >> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> >> > +             /* return ktime_get_mono_fast_ns(); */
> >> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> >> > +     }
> >> > +}
> >>
> >> So these look reasonable enough, but would be good to see some examples
> >> of kfunc implementations that don't just BPF_CALL to a kernel function
> >> (with those helper wrappers we were discussing before).
> >
> > Let's maybe add them if/when needed as we add more metadata support?
> > xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > examples, so it shouldn't be a problem to resurrect them back at some
> > point?
>
> Well, the reason I asked for them is that I think having to maintain the
> BPF code generation in the drivers is probably the biggest drawback of
> the kfunc approach, so it would be good to be relatively sure that we
> can manage that complexity (via helpers) before we commit to this :)

Right, and I've added a bunch of examples in v2 rfc so we can judge
whether that complexity is manageable or not :-)
Do you want me to add those wrappers you've back without any real users?
Because I had to remove my veth tstamp accessors due to John/Jesper
objections; I can maybe bring some of this back gated by some
static_branch to avoid the fastpath cost?
John Fastabend Nov. 16, 2022, 6:38 a.m. UTC | #5
Stanislav Fomichev wrote:
> On Tue, Nov 15, 2022 at 2:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > > On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >>
> > >> Stanislav Fomichev <sdf@google.com> writes:
> > >>
> > >> > The goal is to enable end-to-end testing of the metadata
> > >> > for AF_XDP. Current rx_timestamp kfunc returns current
> > >> > time which should be enough to exercise this new functionality.
> > >> >
> > >> > Cc: John Fastabend <john.fastabend@gmail.com>
> > >> > Cc: David Ahern <dsahern@gmail.com>
> > >> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > >> > Cc: Jakub Kicinski <kuba@kernel.org>
> > >> > Cc: Willem de Bruijn <willemb@google.com>
> > >> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > >> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > >> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > >> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > >> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > >> > Cc: xdp-hints@xdp-project.net
> > >> > Cc: netdev@vger.kernel.org
> > >> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > >> > ---
> > >> >  drivers/net/veth.c | 14 ++++++++++++++
> > >> >  1 file changed, 14 insertions(+)
> > >> >
> > >> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > >> > index 2a4592780141..c626580a2294 100644
> > >> > --- a/drivers/net/veth.c
> > >> > +++ b/drivers/net/veth.c
> > >> > @@ -25,6 +25,7 @@
> > >> >  #include <linux/filter.h>
> > >> >  #include <linux/ptr_ring.h>
> > >> >  #include <linux/bpf_trace.h>
> > >> > +#include <linux/bpf_patch.h>
> > >> >  #include <linux/net_tstamp.h>
> > >> >
> > >> >  #define DRV_NAME     "veth"
> > >> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> > >> >       }
> > >> >  }
> > >> >
> > >> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > >> > +                           struct bpf_patch *patch)
> > >> > +{
> > >> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > >> > +             /* return true; */
> > >> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > >> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > >> > +             /* return ktime_get_mono_fast_ns(); */
> > >> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > >> > +     }
> > >> > +}
> > >>
> > >> So these look reasonable enough, but would be good to see some examples
> > >> of kfunc implementations that don't just BPF_CALL to a kernel function
> > >> (with those helper wrappers we were discussing before).
> > >
> > > Let's maybe add them if/when needed as we add more metadata support?
> > > xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > examples, so it shouldn't be a problem to resurrect them back at some
> > > point?
> >
> > Well, the reason I asked for them is that I think having to maintain the
> > BPF code generation in the drivers is probably the biggest drawback of
> > the kfunc approach, so it would be good to be relatively sure that we
> > can manage that complexity (via helpers) before we commit to this :)
> 
> Right, and I've added a bunch of examples in v2 rfc so we can judge
> whether that complexity is manageable or not :-)
> Do you want me to add those wrappers you've back without any real users?
> Because I had to remove my veth tstamp accessors due to John/Jesper
> objections; I can maybe bring some of this back gated by some
> static_branch to avoid the fastpath cost?

I missed the context a bit what did you mean "would be good to see some
examples of kfunc implementations that don't just BPF_CALL to a kernel
function"? In this case do you mean BPF code directly without the call?

Early on I thought we should just expose the rx_descriptor which would
be roughly the same right? (difference being code embedded in driver vs
a lib) Trouble I ran into is driver code using seqlock_t and mutexs
which wasn't as straight forward as the simpler just read it from
the descriptor. For example in mlx getting the ts would be easy from
BPF with the mlx4_cqe struct exposed

u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
{
        u64 hi, lo;
        struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;

        lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
        hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;

        return hi | lo;
}

but converting that to nsec is a bit annoying,

void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
                            struct skb_shared_hwtstamps *hwts,
                            u64 timestamp)
{
        unsigned int seq;
        u64 nsec;

        do {
                seq = read_seqbegin(&mdev->clock_lock);
                nsec = timecounter_cyc2time(&mdev->clock, timestamp);
        } while (read_seqretry(&mdev->clock_lock, seq));

        memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
        hwts->hwtstamp = ns_to_ktime(nsec);
}

I think the nsec is what you really want.

With all the drivers doing slightly different ops we would have
to create read_seqbegin, read_seqretry, mutex_lock, ... to get
at least the mlx and ice drivers it looks like we would need some
more BPF primitives/helpers. Looks like some more work is needed
on ice driver though to get rx tstamps on all packets.

Anyways this convinced me real devices will probably use BPF_CALL
and not BPF insns directly.
Martin KaFai Lau Nov. 16, 2022, 7:47 a.m. UTC | #6
On 11/15/22 10:38 PM, John Fastabend wrote:
>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>>>>>> +                           struct bpf_patch *patch)
>>>>>> +{
>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>>>>>> +             /* return true; */
>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>>>>>> +             /* return ktime_get_mono_fast_ns(); */
>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>>>>>> +     }
>>>>>> +}
>>>>>
>>>>> So these look reasonable enough, but would be good to see some examples
>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
>>>>> (with those helper wrappers we were discussing before).
>>>>
>>>> Let's maybe add them if/when needed as we add more metadata support?
>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
>>>> examples, so it shouldn't be a problem to resurrect them back at some
>>>> point?
>>>
>>> Well, the reason I asked for them is that I think having to maintain the
>>> BPF code generation in the drivers is probably the biggest drawback of
>>> the kfunc approach, so it would be good to be relatively sure that we
>>> can manage that complexity (via helpers) before we commit to this :)
>>
>> Right, and I've added a bunch of examples in v2 rfc so we can judge
>> whether that complexity is manageable or not :-)
>> Do you want me to add those wrappers you've back without any real users?
>> Because I had to remove my veth tstamp accessors due to John/Jesper
>> objections; I can maybe bring some of this back gated by some
>> static_branch to avoid the fastpath cost?
> 
> I missed the context a bit what did you mean "would be good to see some
> examples of kfunc implementations that don't just BPF_CALL to a kernel
> function"? In this case do you mean BPF code directly without the call?
> 
> Early on I thought we should just expose the rx_descriptor which would
> be roughly the same right? (difference being code embedded in driver vs
> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> which wasn't as straight forward as the simpler just read it from
> the descriptor. For example in mlx getting the ts would be easy from
> BPF with the mlx4_cqe struct exposed
> 
> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> {
>          u64 hi, lo;
>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> 
>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> 
>          return hi | lo;
> }
> 
> but converting that to nsec is a bit annoying,
> 
> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>                              struct skb_shared_hwtstamps *hwts,
>                              u64 timestamp)
> {
>          unsigned int seq;
>          u64 nsec;
> 
>          do {
>                  seq = read_seqbegin(&mdev->clock_lock);
>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>          } while (read_seqretry(&mdev->clock_lock, seq));
> 
>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
>          hwts->hwtstamp = ns_to_ktime(nsec);
> }
> 
> I think the nsec is what you really want.
> 
> With all the drivers doing slightly different ops we would have
> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> at least the mlx and ice drivers it looks like we would need some
> more BPF primitives/helpers. Looks like some more work is needed
> on ice driver though to get rx tstamps on all packets.
> 
> Anyways this convinced me real devices will probably use BPF_CALL
> and not BPF insns directly.

Some of the mlx5 path looks like this:

#define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))

static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
                                               u64 timestamp)
{
         u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);

         return ns_to_ktime(time);
}

If some hints are harder to get, then just doing a kfunc call is better.

csum may have a better chance to inline?

Regardless, BPF in-lining is a well solved problem and used in many bpf helpers 
already, so there are many examples in the kernel.  I don't think it is 
necessary to block this series because of missing some helper wrappers for 
inlining.  The driver can always start with the simpler kfunc call first and 
optimize later if some hints from the drivers allow it.
Toke Høiland-Jørgensen Nov. 16, 2022, 10:08 a.m. UTC | #7
Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/15/22 10:38 PM, John Fastabend wrote:
>>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>>>>>>> +                           struct bpf_patch *patch)
>>>>>>> +{
>>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>>>>>>> +             /* return true; */
>>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>>>>>>> +             /* return ktime_get_mono_fast_ns(); */
>>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>>>>>>> +     }
>>>>>>> +}
>>>>>>
>>>>>> So these look reasonable enough, but would be good to see some examples
>>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
>>>>>> (with those helper wrappers we were discussing before).
>>>>>
>>>>> Let's maybe add them if/when needed as we add more metadata support?
>>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
>>>>> examples, so it shouldn't be a problem to resurrect them back at some
>>>>> point?
>>>>
>>>> Well, the reason I asked for them is that I think having to maintain the
>>>> BPF code generation in the drivers is probably the biggest drawback of
>>>> the kfunc approach, so it would be good to be relatively sure that we
>>>> can manage that complexity (via helpers) before we commit to this :)
>>>
>>> Right, and I've added a bunch of examples in v2 rfc so we can judge
>>> whether that complexity is manageable or not :-)
>>> Do you want me to add those wrappers you've back without any real users?
>>> Because I had to remove my veth tstamp accessors due to John/Jesper
>>> objections; I can maybe bring some of this back gated by some
>>> static_branch to avoid the fastpath cost?
>> 
>> I missed the context a bit what did you mean "would be good to see some
>> examples of kfunc implementations that don't just BPF_CALL to a kernel
>> function"? In this case do you mean BPF code directly without the call?
>> 
>> Early on I thought we should just expose the rx_descriptor which would
>> be roughly the same right? (difference being code embedded in driver vs
>> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
>> which wasn't as straight forward as the simpler just read it from
>> the descriptor. For example in mlx getting the ts would be easy from
>> BPF with the mlx4_cqe struct exposed
>> 
>> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
>> {
>>          u64 hi, lo;
>>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
>> 
>>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
>>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
>> 
>>          return hi | lo;
>> }
>> 
>> but converting that to nsec is a bit annoying,
>> 
>> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>                              struct skb_shared_hwtstamps *hwts,
>>                              u64 timestamp)
>> {
>>          unsigned int seq;
>>          u64 nsec;
>> 
>>          do {
>>                  seq = read_seqbegin(&mdev->clock_lock);
>>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>>          } while (read_seqretry(&mdev->clock_lock, seq));
>> 
>>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
>>          hwts->hwtstamp = ns_to_ktime(nsec);
>> }
>> 
>> I think the nsec is what you really want.
>> 
>> With all the drivers doing slightly different ops we would have
>> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
>> at least the mlx and ice drivers it looks like we would need some
>> more BPF primitives/helpers. Looks like some more work is needed
>> on ice driver though to get rx tstamps on all packets.
>> 
>> Anyways this convinced me real devices will probably use BPF_CALL
>> and not BPF insns directly.
>
> Some of the mlx5 path looks like this:
>
> #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
>
> static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
>                                                u64 timestamp)
> {
>          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
>
>          return ns_to_ktime(time);
> }
>
> If some hints are harder to get, then just doing a kfunc call is better.

Sure, but if we end up having a full function call for every field in
the metadata, that will end up having a significant performance impact
on the XDP data path (thinking mostly about the skb metadata case here,
which will collect several bits of metadata).

> csum may have a better chance to inline?

Yup, I agree. Including that also makes it possible to benchmark this
series against Jesper's; which I think we should definitely be doing
before merging this.

> Regardless, BPF in-lining is a well solved problem and used in many
> bpf helpers already, so there are many examples in the kernel. I don't
> think it is necessary to block this series because of missing some
> helper wrappers for inlining. The driver can always start with the
> simpler kfunc call first and optimize later if some hints from the
> drivers allow it.

Well, "solved" in the sense of "there are a few handfuls of core BPF
people who know how to do it". My concern is that we'll end up with
either the BPF devs having to maintain all these bits of BPF byte code
in all the drivers; or drivers just punting to regular function calls
because the inlining is too complicated, with sub-par performance as per
the above. I don't think we should just hand-wave this away as "solved",
but rather treat this as an integral part of the initial series.

-Toke
Martin KaFai Lau Nov. 16, 2022, 6:20 p.m. UTC | #8
On 11/16/22 2:08 AM, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
>> On 11/15/22 10:38 PM, John Fastabend wrote:
>>>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>>>>>>>> +                           struct bpf_patch *patch)
>>>>>>>> +{
>>>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>>>>>>>> +             /* return true; */
>>>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>>>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>>>>>>>> +             /* return ktime_get_mono_fast_ns(); */
>>>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>>>>>>>> +     }
>>>>>>>> +}
>>>>>>>
>>>>>>> So these look reasonable enough, but would be good to see some examples
>>>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
>>>>>>> (with those helper wrappers we were discussing before).
>>>>>>
>>>>>> Let's maybe add them if/when needed as we add more metadata support?
>>>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
>>>>>> examples, so it shouldn't be a problem to resurrect them back at some
>>>>>> point?
>>>>>
>>>>> Well, the reason I asked for them is that I think having to maintain the
>>>>> BPF code generation in the drivers is probably the biggest drawback of
>>>>> the kfunc approach, so it would be good to be relatively sure that we
>>>>> can manage that complexity (via helpers) before we commit to this :)
>>>>
>>>> Right, and I've added a bunch of examples in v2 rfc so we can judge
>>>> whether that complexity is manageable or not :-)
>>>> Do you want me to add those wrappers you've back without any real users?
>>>> Because I had to remove my veth tstamp accessors due to John/Jesper
>>>> objections; I can maybe bring some of this back gated by some
>>>> static_branch to avoid the fastpath cost?
>>>
>>> I missed the context a bit what did you mean "would be good to see some
>>> examples of kfunc implementations that don't just BPF_CALL to a kernel
>>> function"? In this case do you mean BPF code directly without the call?
>>>
>>> Early on I thought we should just expose the rx_descriptor which would
>>> be roughly the same right? (difference being code embedded in driver vs
>>> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
>>> which wasn't as straight forward as the simpler just read it from
>>> the descriptor. For example in mlx getting the ts would be easy from
>>> BPF with the mlx4_cqe struct exposed
>>>
>>> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
>>> {
>>>           u64 hi, lo;
>>>           struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
>>>
>>>           lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
>>>           hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
>>>
>>>           return hi | lo;
>>> }
>>>
>>> but converting that to nsec is a bit annoying,
>>>
>>> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>>                               struct skb_shared_hwtstamps *hwts,
>>>                               u64 timestamp)
>>> {
>>>           unsigned int seq;
>>>           u64 nsec;
>>>
>>>           do {
>>>                   seq = read_seqbegin(&mdev->clock_lock);
>>>                   nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>>>           } while (read_seqretry(&mdev->clock_lock, seq));
>>>
>>>           memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
>>>           hwts->hwtstamp = ns_to_ktime(nsec);
>>> }
>>>
>>> I think the nsec is what you really want.
>>>
>>> With all the drivers doing slightly different ops we would have
>>> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
>>> at least the mlx and ice drivers it looks like we would need some
>>> more BPF primitives/helpers. Looks like some more work is needed
>>> on ice driver though to get rx tstamps on all packets.
>>>
>>> Anyways this convinced me real devices will probably use BPF_CALL
>>> and not BPF insns directly.
>>
>> Some of the mlx5 path looks like this:
>>
>> #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
>>
>> static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
>>                                                 u64 timestamp)
>> {
>>           u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
>>
>>           return ns_to_ktime(time);
>> }
>>
>> If some hints are harder to get, then just doing a kfunc call is better.
> 
> Sure, but if we end up having a full function call for every field in
> the metadata, that will end up having a significant performance impact
> on the XDP data path (thinking mostly about the skb metadata case here,
> which will collect several bits of metadata).
> 
>> csum may have a better chance to inline?

It will be useful if the skb_metadata can get at least one more field like csum 
or vlan.

> 
> Yup, I agree. Including that also makes it possible to benchmark this
> series against Jesper's; which I think we should definitely be doing
> before merging this.

If the hint needs a lock to obtain it, it is not cheap to begin with regardless 
of kfunc or not.  Also, there is bpf_xdp_metadata_rx_timestamp_supported() 
before doing the bpf_xdp_metadata_rx_timestamp().

This set gives the xdp prog a flexible way to avoid getting all hints (some of 
them require a lock) if all the xdp prog needs is a csum.  imo, giving this 
flexibility to the xdp prog is the important thing for this set in terms of 
performance.

> 
>> Regardless, BPF in-lining is a well solved problem and used in many
>> bpf helpers already, so there are many examples in the kernel. I don't
>> think it is necessary to block this series because of missing some
>> helper wrappers for inlining. The driver can always start with the
>> simpler kfunc call first and optimize later if some hints from the
>> drivers allow it.
> 
> Well, "solved" in the sense of "there are a few handfuls of core BPF
> people who know how to do it". My concern is that we'll end up with
> either the BPF devs having to maintain all these bits of BPF byte code
> in all the drivers; or drivers just punting to regular function calls
> because the inlining is too complicated, with sub-par performance as per
> the above. I don't think we should just hand-wave this away as "solved",
> but rather treat this as an integral part of the initial series.

In terms of complexity/maintainability, I don't think the driver needs to inline 
like hundreds of bpf insn for a hint which is already a good signal that it 
should just call kfunc.  For a simple hint, xdp_metadata_export_to_skb() is a 
good example to start with and I am not sure how much a helper wrapper can do to 
simplify things further since each driver inline code is going to be different.

The community's review will definitely be useful for the first few drivers when 
the driver changes is posted to the list, and the latter drivers will have 
easier time to follow like the xdp was initially introduced to the few drivers 
first.  When there are things to refactor after enough drivers supporting it, 
that will be a better time to revisit.
John Fastabend Nov. 16, 2022, 7:03 p.m. UTC | #9
Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
> > On 11/15/22 10:38 PM, John Fastabend wrote:
> >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> >>>>>>> +                           struct bpf_patch *patch)
> >>>>>>> +{
> >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> >>>>>>> +             /* return true; */
> >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> >>>>>>> +     }
> >>>>>>> +}
> >>>>>>
> >>>>>> So these look reasonable enough, but would be good to see some examples
> >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> >>>>>> (with those helper wrappers we were discussing before).
> >>>>>
> >>>>> Let's maybe add them if/when needed as we add more metadata support?
> >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> >>>>> point?
> >>>>
> >>>> Well, the reason I asked for them is that I think having to maintain the
> >>>> BPF code generation in the drivers is probably the biggest drawback of
> >>>> the kfunc approach, so it would be good to be relatively sure that we
> >>>> can manage that complexity (via helpers) before we commit to this :)
> >>>
> >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> >>> whether that complexity is manageable or not :-)
> >>> Do you want me to add those wrappers you've back without any real users?
> >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> >>> objections; I can maybe bring some of this back gated by some
> >>> static_branch to avoid the fastpath cost?
> >> 
> >> I missed the context a bit what did you mean "would be good to see some
> >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> >> function"? In this case do you mean BPF code directly without the call?
> >> 
> >> Early on I thought we should just expose the rx_descriptor which would
> >> be roughly the same right? (difference being code embedded in driver vs
> >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> >> which wasn't as straight forward as the simpler just read it from
> >> the descriptor. For example in mlx getting the ts would be easy from
> >> BPF with the mlx4_cqe struct exposed
> >> 
> >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> >> {
> >>          u64 hi, lo;
> >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> >> 
> >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> >> 
> >>          return hi | lo;
> >> }
> >> 
> >> but converting that to nsec is a bit annoying,
> >> 
> >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> >>                              struct skb_shared_hwtstamps *hwts,
> >>                              u64 timestamp)
> >> {
> >>          unsigned int seq;
> >>          u64 nsec;
> >> 
> >>          do {
> >>                  seq = read_seqbegin(&mdev->clock_lock);
> >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> >>          } while (read_seqretry(&mdev->clock_lock, seq));
> >> 
> >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> >>          hwts->hwtstamp = ns_to_ktime(nsec);
> >> }
> >> 
> >> I think the nsec is what you really want.
> >> 
> >> With all the drivers doing slightly different ops we would have
> >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> >> at least the mlx and ice drivers it looks like we would need some
> >> more BPF primitives/helpers. Looks like some more work is needed
> >> on ice driver though to get rx tstamps on all packets.
> >> 
> >> Anyways this convinced me real devices will probably use BPF_CALL
> >> and not BPF insns directly.
> >
> > Some of the mlx5 path looks like this:
> >
> > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> >
> > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> >                                                u64 timestamp)
> > {
> >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> >
> >          return ns_to_ktime(time);
> > }
> >
> > If some hints are harder to get, then just doing a kfunc call is better.
> 
> Sure, but if we end up having a full function call for every field in
> the metadata, that will end up having a significant performance impact
> on the XDP data path (thinking mostly about the skb metadata case here,
> which will collect several bits of metadata).
> 
> > csum may have a better chance to inline?
> 
> Yup, I agree. Including that also makes it possible to benchmark this
> series against Jesper's; which I think we should definitely be doing
> before merging this.

Good point I got sort of singularly focused on timestamp because I have
a use case for it now.

Also hash is often sitting in the rx descriptor.

> 
> > Regardless, BPF in-lining is a well solved problem and used in many
> > bpf helpers already, so there are many examples in the kernel. I don't
> > think it is necessary to block this series because of missing some
> > helper wrappers for inlining. The driver can always start with the
> > simpler kfunc call first and optimize later if some hints from the
> > drivers allow it.
> 
> Well, "solved" in the sense of "there are a few handfuls of core BPF
> people who know how to do it". My concern is that we'll end up with
> either the BPF devs having to maintain all these bits of BPF byte code
> in all the drivers; or drivers just punting to regular function calls
> because the inlining is too complicated, with sub-par performance as per
> the above. I don't think we should just hand-wave this away as "solved",
> but rather treat this as an integral part of the initial series.

This was my motivation for pushing the rx_descriptor into the xdp_buff.
At this point if I'm going to have a kfunc call into the driver and
have the driver rewrite the code into some BPF instructions I would
just assume maintain this as a library code where I can hook it
into my BPF program directly from user space. Maybe a few drivers
will support all the things I want to read, but we run on lots of
hardware (mlx, intel, eks, azure, gke, etc) and its been a lot of work
to just get the basic feature parity. I also don't want to run around
writing driver code for each vendor if I can avoid it. Having raw
access to the rx descriptor gets me the escape hatch where I can
just do it myself. And the last piece of info from my point of view
(Tetragon, Cilium) I can run whatever libs I want and freely upgrade
libbpf and cilium/ebpf but have a lot less ability to get users
to upgrade kernels outside the LTS they picked. Meaning I can
add new things much easier if its lifted into BPF code placed
by user space.

I appreciate that it means I import the problem of hardware detection
and BTF CO-RE on networking codes, but we've already solved these
problems for other reasons. For example just configuring the timestamp
is a bit of an exercise in does my hardware support timestamping
and does it support timestamping the packets I care about,  e.g.
all pkts, just ptp pkts, etc.

I don't think they are mutual exclusive with this series though
because I can't see how to write these timestamping logic directly
in BPF. But for rxhash and csum it seems doable. My preference
is to have both the kfuncs and expose the descriptor directly.

.John
Stanislav Fomichev Nov. 16, 2022, 8:50 p.m. UTC | #10
On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Toke Høiland-Jørgensen wrote:
> > Martin KaFai Lau <martin.lau@linux.dev> writes:
> >
> > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > >>>>>>> +                           struct bpf_patch *patch)
> > >>>>>>> +{
> > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > >>>>>>> +             /* return true; */
> > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > >>>>>>> +     }
> > >>>>>>> +}
> > >>>>>>
> > >>>>>> So these look reasonable enough, but would be good to see some examples
> > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > >>>>>> (with those helper wrappers we were discussing before).
> > >>>>>
> > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > >>>>> point?
> > >>>>
> > >>>> Well, the reason I asked for them is that I think having to maintain the
> > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > >>>> can manage that complexity (via helpers) before we commit to this :)
> > >>>
> > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > >>> whether that complexity is manageable or not :-)
> > >>> Do you want me to add those wrappers you've back without any real users?
> > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > >>> objections; I can maybe bring some of this back gated by some
> > >>> static_branch to avoid the fastpath cost?
> > >>
> > >> I missed the context a bit what did you mean "would be good to see some
> > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > >> function"? In this case do you mean BPF code directly without the call?
> > >>
> > >> Early on I thought we should just expose the rx_descriptor which would
> > >> be roughly the same right? (difference being code embedded in driver vs
> > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > >> which wasn't as straight forward as the simpler just read it from
> > >> the descriptor. For example in mlx getting the ts would be easy from
> > >> BPF with the mlx4_cqe struct exposed
> > >>
> > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > >> {
> > >>          u64 hi, lo;
> > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > >>
> > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > >>
> > >>          return hi | lo;
> > >> }
> > >>
> > >> but converting that to nsec is a bit annoying,
> > >>
> > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > >>                              struct skb_shared_hwtstamps *hwts,
> > >>                              u64 timestamp)
> > >> {
> > >>          unsigned int seq;
> > >>          u64 nsec;
> > >>
> > >>          do {
> > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > >>
> > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > >> }
> > >>
> > >> I think the nsec is what you really want.
> > >>
> > >> With all the drivers doing slightly different ops we would have
> > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > >> at least the mlx and ice drivers it looks like we would need some
> > >> more BPF primitives/helpers. Looks like some more work is needed
> > >> on ice driver though to get rx tstamps on all packets.
> > >>
> > >> Anyways this convinced me real devices will probably use BPF_CALL
> > >> and not BPF insns directly.
> > >
> > > Some of the mlx5 path looks like this:
> > >
> > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > >
> > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > >                                                u64 timestamp)
> > > {
> > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > >
> > >          return ns_to_ktime(time);
> > > }
> > >
> > > If some hints are harder to get, then just doing a kfunc call is better.
> >
> > Sure, but if we end up having a full function call for every field in
> > the metadata, that will end up having a significant performance impact
> > on the XDP data path (thinking mostly about the skb metadata case here,
> > which will collect several bits of metadata).
> >
> > > csum may have a better chance to inline?
> >
> > Yup, I agree. Including that also makes it possible to benchmark this
> > series against Jesper's; which I think we should definitely be doing
> > before merging this.
>
> Good point I got sort of singularly focused on timestamp because I have
> a use case for it now.
>
> Also hash is often sitting in the rx descriptor.

Ack, let me try to add something else (that's more inline-able) on the
rx side for a v2.

> >
> > > Regardless, BPF in-lining is a well solved problem and used in many
> > > bpf helpers already, so there are many examples in the kernel. I don't
> > > think it is necessary to block this series because of missing some
> > > helper wrappers for inlining. The driver can always start with the
> > > simpler kfunc call first and optimize later if some hints from the
> > > drivers allow it.
> >
> > Well, "solved" in the sense of "there are a few handfuls of core BPF
> > people who know how to do it". My concern is that we'll end up with
> > either the BPF devs having to maintain all these bits of BPF byte code
> > in all the drivers; or drivers just punting to regular function calls
> > because the inlining is too complicated, with sub-par performance as per
> > the above. I don't think we should just hand-wave this away as "solved",
> > but rather treat this as an integral part of the initial series.
>
> This was my motivation for pushing the rx_descriptor into the xdp_buff.
> At this point if I'm going to have a kfunc call into the driver and
> have the driver rewrite the code into some BPF instructions I would
> just assume maintain this as a library code where I can hook it
> into my BPF program directly from user space. Maybe a few drivers
> will support all the things I want to read, but we run on lots of
> hardware (mlx, intel, eks, azure, gke, etc) and its been a lot of work
> to just get the basic feature parity. I also don't want to run around
> writing driver code for each vendor if I can avoid it. Having raw
> access to the rx descriptor gets me the escape hatch where I can
> just do it myself. And the last piece of info from my point of view
> (Tetragon, Cilium) I can run whatever libs I want and freely upgrade
> libbpf and cilium/ebpf but have a lot less ability to get users
> to upgrade kernels outside the LTS they picked. Meaning I can
> add new things much easier if its lifted into BPF code placed
> by user space.
>
> I appreciate that it means I import the problem of hardware detection
> and BTF CO-RE on networking codes, but we've already solved these
> problems for other reasons. For example just configuring the timestamp
> is a bit of an exercise in does my hardware support timestamping
> and does it support timestamping the packets I care about,  e.g.
> all pkts, just ptp pkts, etc.
>
> I don't think they are mutual exclusive with this series though
> because I can't see how to write these timestamping logic directly
> in BPF. But for rxhash and csum it seems doable. My preference
> is to have both the kfuncs and expose the descriptor directly.
>
> .John
John Fastabend Nov. 16, 2022, 11:47 p.m. UTC | #11
Stanislav Fomichev wrote:
> On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Toke Høiland-Jørgensen wrote:
> > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > >
> > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > >>>>>>> +                           struct bpf_patch *patch)
> > > >>>>>>> +{
> > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > >>>>>>> +             /* return true; */
> > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > >>>>>>> +     }
> > > >>>>>>> +}
> > > >>>>>>
> > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > >>>>>> (with those helper wrappers we were discussing before).
> > > >>>>>
> > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > >>>>> point?
> > > >>>>
> > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > >>>
> > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > >>> whether that complexity is manageable or not :-)
> > > >>> Do you want me to add those wrappers you've back without any real users?
> > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > >>> objections; I can maybe bring some of this back gated by some
> > > >>> static_branch to avoid the fastpath cost?
> > > >>
> > > >> I missed the context a bit what did you mean "would be good to see some
> > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > >> function"? In this case do you mean BPF code directly without the call?
> > > >>
> > > >> Early on I thought we should just expose the rx_descriptor which would
> > > >> be roughly the same right? (difference being code embedded in driver vs
> > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > >> which wasn't as straight forward as the simpler just read it from
> > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > >> BPF with the mlx4_cqe struct exposed
> > > >>
> > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > >> {
> > > >>          u64 hi, lo;
> > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > >>
> > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > >>
> > > >>          return hi | lo;
> > > >> }
> > > >>
> > > >> but converting that to nsec is a bit annoying,
> > > >>
> > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > >>                              struct skb_shared_hwtstamps *hwts,
> > > >>                              u64 timestamp)
> > > >> {
> > > >>          unsigned int seq;
> > > >>          u64 nsec;
> > > >>
> > > >>          do {
> > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > >>
> > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > >> }
> > > >>
> > > >> I think the nsec is what you really want.
> > > >>
> > > >> With all the drivers doing slightly different ops we would have
> > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > >> at least the mlx and ice drivers it looks like we would need some
> > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > >> on ice driver though to get rx tstamps on all packets.
> > > >>
> > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > >> and not BPF insns directly.
> > > >
> > > > Some of the mlx5 path looks like this:
> > > >
> > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > >
> > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > >                                                u64 timestamp)
> > > > {
> > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > >
> > > >          return ns_to_ktime(time);
> > > > }
> > > >
> > > > If some hints are harder to get, then just doing a kfunc call is better.
> > >
> > > Sure, but if we end up having a full function call for every field in
> > > the metadata, that will end up having a significant performance impact
> > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > which will collect several bits of metadata).
> > >
> > > > csum may have a better chance to inline?
> > >
> > > Yup, I agree. Including that also makes it possible to benchmark this
> > > series against Jesper's; which I think we should definitely be doing
> > > before merging this.
> >
> > Good point I got sort of singularly focused on timestamp because I have
> > a use case for it now.
> >
> > Also hash is often sitting in the rx descriptor.
> 
> Ack, let me try to add something else (that's more inline-able) on the
> rx side for a v2.

If you go with in-kernel BPF kfunc approach (vs user space side) I think
you also need to add CO-RE to be friendly for driver developers? Otherwise
they have to keep that read in sync with the descriptors? Also need to
handle versioning of descriptors where depending on specific options
and firmware and chip being enabled the descriptor might be moving
around. Of course can push this all to developer, but seems not so
nice when we have the machinery to do this and we handle it for all
other structures.

With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
expect CO-RE sorts it out based on currently running btf_id of the
descriptor. If you go through normal path you get this for free of
course.

.John
Stanislav Fomichev Nov. 17, 2022, 12:19 a.m. UTC | #12
On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > <john.fastabend@gmail.com> wrote:
> > >
> > > Toke Høiland-Jørgensen wrote:
> > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > >
> > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > >>>>>>> +{
> > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > >>>>>>> +             /* return true; */
> > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > >>>>>>> +     }
> > > > >>>>>>> +}
> > > > >>>>>>
> > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > >>>>>
> > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > >>>>> point?
> > > > >>>>
> > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > >>>
> > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > >>> whether that complexity is manageable or not :-)
> > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > >>> objections; I can maybe bring some of this back gated by some
> > > > >>> static_branch to avoid the fastpath cost?
> > > > >>
> > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > >>
> > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > >> which wasn't as straight forward as the simpler just read it from
> > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > >> BPF with the mlx4_cqe struct exposed
> > > > >>
> > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > >> {
> > > > >>          u64 hi, lo;
> > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > >>
> > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > >>
> > > > >>          return hi | lo;
> > > > >> }
> > > > >>
> > > > >> but converting that to nsec is a bit annoying,
> > > > >>
> > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > >>                              u64 timestamp)
> > > > >> {
> > > > >>          unsigned int seq;
> > > > >>          u64 nsec;
> > > > >>
> > > > >>          do {
> > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > >>
> > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > >> }
> > > > >>
> > > > >> I think the nsec is what you really want.
> > > > >>
> > > > >> With all the drivers doing slightly different ops we would have
> > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > >> on ice driver though to get rx tstamps on all packets.
> > > > >>
> > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > >> and not BPF insns directly.
> > > > >
> > > > > Some of the mlx5 path looks like this:
> > > > >
> > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > >
> > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > >                                                u64 timestamp)
> > > > > {
> > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > >
> > > > >          return ns_to_ktime(time);
> > > > > }
> > > > >
> > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > >
> > > > Sure, but if we end up having a full function call for every field in
> > > > the metadata, that will end up having a significant performance impact
> > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > which will collect several bits of metadata).
> > > >
> > > > > csum may have a better chance to inline?
> > > >
> > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > series against Jesper's; which I think we should definitely be doing
> > > > before merging this.
> > >
> > > Good point I got sort of singularly focused on timestamp because I have
> > > a use case for it now.
> > >
> > > Also hash is often sitting in the rx descriptor.
> >
> > Ack, let me try to add something else (that's more inline-able) on the
> > rx side for a v2.
>
> If you go with in-kernel BPF kfunc approach (vs user space side) I think
> you also need to add CO-RE to be friendly for driver developers? Otherwise
> they have to keep that read in sync with the descriptors? Also need to
> handle versioning of descriptors where depending on specific options
> and firmware and chip being enabled the descriptor might be moving
> around. Of course can push this all to developer, but seems not so
> nice when we have the machinery to do this and we handle it for all
> other structures.
>
> With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> expect CO-RE sorts it out based on currently running btf_id of the
> descriptor. If you go through normal path you get this for free of
> course.

Doesn't look like the descriptors are as nice as you're trying to
paint them (with clear hash/csum fields) :-) So not sure how much
CO-RE would help.
At least looking at mlx4 rx_csum, the driver consults three different
sets of flags to figure out the hash_type. Or am I just unlucky with
mlx4?
Alexei Starovoitov Nov. 17, 2022, 2:17 a.m. UTC | #13
On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Stanislav Fomichev wrote:
> > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > <john.fastabend@gmail.com> wrote:
> > > >
> > > > Toke Høiland-Jørgensen wrote:
> > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > >
> > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > >>>>>>> +{
> > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > >>>>>>> +             /* return true; */
> > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > >>>>>>> +     }
> > > > > >>>>>>> +}
> > > > > >>>>>>
> > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > >>>>>
> > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > >>>>> point?
> > > > > >>>>
> > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > >>>
> > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > >>> whether that complexity is manageable or not :-)
> > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > >>> static_branch to avoid the fastpath cost?
> > > > > >>
> > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > >>
> > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > >>
> > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > >> {
> > > > > >>          u64 hi, lo;
> > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > >>
> > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > >>
> > > > > >>          return hi | lo;
> > > > > >> }
> > > > > >>
> > > > > >> but converting that to nsec is a bit annoying,
> > > > > >>
> > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > >>                              u64 timestamp)
> > > > > >> {
> > > > > >>          unsigned int seq;
> > > > > >>          u64 nsec;
> > > > > >>
> > > > > >>          do {
> > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > >>
> > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > >> }
> > > > > >>
> > > > > >> I think the nsec is what you really want.
> > > > > >>
> > > > > >> With all the drivers doing slightly different ops we would have
> > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > >>
> > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > >> and not BPF insns directly.
> > > > > >
> > > > > > Some of the mlx5 path looks like this:
> > > > > >
> > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > >
> > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > >                                                u64 timestamp)
> > > > > > {
> > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > >
> > > > > >          return ns_to_ktime(time);
> > > > > > }
> > > > > >
> > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > >
> > > > > Sure, but if we end up having a full function call for every field in
> > > > > the metadata, that will end up having a significant performance impact
> > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > which will collect several bits of metadata).
> > > > >
> > > > > > csum may have a better chance to inline?
> > > > >
> > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > series against Jesper's; which I think we should definitely be doing
> > > > > before merging this.
> > > >
> > > > Good point I got sort of singularly focused on timestamp because I have
> > > > a use case for it now.
> > > >
> > > > Also hash is often sitting in the rx descriptor.
> > >
> > > Ack, let me try to add something else (that's more inline-able) on the
> > > rx side for a v2.
> >
> > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > they have to keep that read in sync with the descriptors? Also need to
> > handle versioning of descriptors where depending on specific options
> > and firmware and chip being enabled the descriptor might be moving
> > around. Of course can push this all to developer, but seems not so
> > nice when we have the machinery to do this and we handle it for all
> > other structures.
> >
> > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > expect CO-RE sorts it out based on currently running btf_id of the
> > descriptor. If you go through normal path you get this for free of
> > course.
>
> Doesn't look like the descriptors are as nice as you're trying to
> paint them (with clear hash/csum fields) :-) So not sure how much
> CO-RE would help.
> At least looking at mlx4 rx_csum, the driver consults three different
> sets of flags to figure out the hash_type. Or am I just unlucky with
> mlx4?

Which part are you talking about ?
        hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
is trivial enough for bpf prog to do if it has access to 'cqe' pointer
which is what John is proposing (I think).
Stanislav Fomichev Nov. 17, 2022, 2:53 a.m. UTC | #14
On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Stanislav Fomichev wrote:
> > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Toke Høiland-Jørgensen wrote:
> > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > >
> > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > >>>>>>> +{
> > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > >>>>>>> +             /* return true; */
> > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > >>>>>>> +     }
> > > > > > >>>>>>> +}
> > > > > > >>>>>>
> > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > >>>>>
> > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > >>>>> point?
> > > > > > >>>>
> > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > >>>
> > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > >>
> > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > >>
> > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > >>
> > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > >> {
> > > > > > >>          u64 hi, lo;
> > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > >>
> > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > >>
> > > > > > >>          return hi | lo;
> > > > > > >> }
> > > > > > >>
> > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > >>
> > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > >>                              u64 timestamp)
> > > > > > >> {
> > > > > > >>          unsigned int seq;
> > > > > > >>          u64 nsec;
> > > > > > >>
> > > > > > >>          do {
> > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > >>
> > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > >> }
> > > > > > >>
> > > > > > >> I think the nsec is what you really want.
> > > > > > >>
> > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > >>
> > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > >> and not BPF insns directly.
> > > > > > >
> > > > > > > Some of the mlx5 path looks like this:
> > > > > > >
> > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > >
> > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > >                                                u64 timestamp)
> > > > > > > {
> > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > >
> > > > > > >          return ns_to_ktime(time);
> > > > > > > }
> > > > > > >
> > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > >
> > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > the metadata, that will end up having a significant performance impact
> > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > which will collect several bits of metadata).
> > > > > >
> > > > > > > csum may have a better chance to inline?
> > > > > >
> > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > before merging this.
> > > > >
> > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > a use case for it now.
> > > > >
> > > > > Also hash is often sitting in the rx descriptor.
> > > >
> > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > rx side for a v2.
> > >
> > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > they have to keep that read in sync with the descriptors? Also need to
> > > handle versioning of descriptors where depending on specific options
> > > and firmware and chip being enabled the descriptor might be moving
> > > around. Of course can push this all to developer, but seems not so
> > > nice when we have the machinery to do this and we handle it for all
> > > other structures.
> > >
> > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > expect CO-RE sorts it out based on currently running btf_id of the
> > > descriptor. If you go through normal path you get this for free of
> > > course.
> >
> > Doesn't look like the descriptors are as nice as you're trying to
> > paint them (with clear hash/csum fields) :-) So not sure how much
> > CO-RE would help.
> > At least looking at mlx4 rx_csum, the driver consults three different
> > sets of flags to figure out the hash_type. Or am I just unlucky with
> > mlx4?
>
> Which part are you talking about ?
>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> which is what John is proposing (I think).

I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
I'm assuming we want to have hash_type available to the progs?

But also, check_csum handles other corner cases:
- short_frame: we simply force all those small frames to skip checksum complete
- get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
IPv6 header
- get_fixed_ipv4_csum: Although the stack expects checksum which
doesn't include the pseudo header, the HW adds it

So it doesn't look like we can just unconditionally use cqe->checksum?
The driver does a lot of massaging around that field to make it
palatable.
Alexei Starovoitov Nov. 17, 2022, 2:59 a.m. UTC | #15
On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Stanislav Fomichev wrote:
> > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > <john.fastabend@gmail.com> wrote:
> > > > > >
> > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > >
> > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > >>>>>>> +{
> > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > >>>>>>> +             /* return true; */
> > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > >>>>>>> +     }
> > > > > > > >>>>>>> +}
> > > > > > > >>>>>>
> > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > >>>>>
> > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > >>>>> point?
> > > > > > > >>>>
> > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > >>>
> > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > >>
> > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > >>
> > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > >>
> > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > >> {
> > > > > > > >>          u64 hi, lo;
> > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > >>
> > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > >>
> > > > > > > >>          return hi | lo;
> > > > > > > >> }
> > > > > > > >>
> > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > >>
> > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > >>                              u64 timestamp)
> > > > > > > >> {
> > > > > > > >>          unsigned int seq;
> > > > > > > >>          u64 nsec;
> > > > > > > >>
> > > > > > > >>          do {
> > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > >>
> > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > >> }
> > > > > > > >>
> > > > > > > >> I think the nsec is what you really want.
> > > > > > > >>
> > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > >>
> > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > >> and not BPF insns directly.
> > > > > > > >
> > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > >
> > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > >
> > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > >                                                u64 timestamp)
> > > > > > > > {
> > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > >
> > > > > > > >          return ns_to_ktime(time);
> > > > > > > > }
> > > > > > > >
> > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > >
> > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > which will collect several bits of metadata).
> > > > > > >
> > > > > > > > csum may have a better chance to inline?
> > > > > > >
> > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > before merging this.
> > > > > >
> > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > a use case for it now.
> > > > > >
> > > > > > Also hash is often sitting in the rx descriptor.
> > > > >
> > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > rx side for a v2.
> > > >
> > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > they have to keep that read in sync with the descriptors? Also need to
> > > > handle versioning of descriptors where depending on specific options
> > > > and firmware and chip being enabled the descriptor might be moving
> > > > around. Of course can push this all to developer, but seems not so
> > > > nice when we have the machinery to do this and we handle it for all
> > > > other structures.
> > > >
> > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > descriptor. If you go through normal path you get this for free of
> > > > course.
> > >
> > > Doesn't look like the descriptors are as nice as you're trying to
> > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > CO-RE would help.
> > > At least looking at mlx4 rx_csum, the driver consults three different
> > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > mlx4?
> >
> > Which part are you talking about ?
> >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > which is what John is proposing (I think).
>
> I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> I'm assuming we want to have hash_type available to the progs?
>
> But also, check_csum handles other corner cases:
> - short_frame: we simply force all those small frames to skip checksum complete
> - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> IPv6 header
> - get_fixed_ipv4_csum: Although the stack expects checksum which
> doesn't include the pseudo header, the HW adds it

Of course, but kfunc won't be doing them either.
We're talking XDP fast path.
The mlx4 hw is old and incapable.
No amount of sw can help.

> So it doesn't look like we can just unconditionally use cqe->checksum?
> The driver does a lot of massaging around that field to make it
> palatable.

Of course we can. cqe->checksum is still usable. the bpf prog
would need to know what it's reading.
There should be no attempt to present a unified state of hw bits.
That's what skb is for. XDP layer should not hide such hw details.
Otherwise it will become a mini skb layer with all that overhead.
Stanislav Fomichev Nov. 17, 2022, 4:18 a.m. UTC | #16
On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Stanislav Fomichev wrote:
> > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > >
> > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > >
> > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > >>>>>>> +{
> > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > >>>>>>> +     }
> > > > > > > > >>>>>>> +}
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > >>>>>
> > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > >>>>> point?
> > > > > > > > >>>>
> > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > >>>
> > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > >>
> > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > >>
> > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > >>
> > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > >> {
> > > > > > > > >>          u64 hi, lo;
> > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > >>
> > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > >>
> > > > > > > > >>          return hi | lo;
> > > > > > > > >> }
> > > > > > > > >>
> > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > >>
> > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > >>                              u64 timestamp)
> > > > > > > > >> {
> > > > > > > > >>          unsigned int seq;
> > > > > > > > >>          u64 nsec;
> > > > > > > > >>
> > > > > > > > >>          do {
> > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > >>
> > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > >> }
> > > > > > > > >>
> > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > >>
> > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > >>
> > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > >> and not BPF insns directly.
> > > > > > > > >
> > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > >
> > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > >
> > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > >                                                u64 timestamp)
> > > > > > > > > {
> > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > >
> > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > >
> > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > which will collect several bits of metadata).
> > > > > > > >
> > > > > > > > > csum may have a better chance to inline?
> > > > > > > >
> > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > before merging this.
> > > > > > >
> > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > a use case for it now.
> > > > > > >
> > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > >
> > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > rx side for a v2.
> > > > >
> > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > handle versioning of descriptors where depending on specific options
> > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > around. Of course can push this all to developer, but seems not so
> > > > > nice when we have the machinery to do this and we handle it for all
> > > > > other structures.
> > > > >
> > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > descriptor. If you go through normal path you get this for free of
> > > > > course.
> > > >
> > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > CO-RE would help.
> > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > mlx4?
> > >
> > > Which part are you talking about ?
> > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > which is what John is proposing (I think).
> >
> > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > I'm assuming we want to have hash_type available to the progs?
> >
> > But also, check_csum handles other corner cases:
> > - short_frame: we simply force all those small frames to skip checksum complete
> > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > IPv6 header
> > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > doesn't include the pseudo header, the HW adds it
>
> Of course, but kfunc won't be doing them either.
> We're talking XDP fast path.
> The mlx4 hw is old and incapable.
> No amount of sw can help.
>
> > So it doesn't look like we can just unconditionally use cqe->checksum?
> > The driver does a lot of massaging around that field to make it
> > palatable.
>
> Of course we can. cqe->checksum is still usable. the bpf prog
> would need to know what it's reading.
> There should be no attempt to present a unified state of hw bits.
> That's what skb is for. XDP layer should not hide such hw details.
> Otherwise it will become a mini skb layer with all that overhead.

I was hoping the kfunc could at least parse the flags and return some
pkt_hash_types-like enum to indicate what this csum covers.
So the users won't have to find the hardware manuals (not sure they
are even available?) to decipher what numbers they've got.
Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look
that much different :-(

But going back a bit: I'm probably missing what John has been
suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing
a CO-RE-like functionality by rewriting some "public api" (kfunc) into
the bytecode to access the relevant data.
John Fastabend Nov. 17, 2022, 6:55 a.m. UTC | #17
Stanislav Fomichev wrote:
> On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > > >
> > > > > > Stanislav Fomichev wrote:
> > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > > >
> > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > > >>>>>>> +{
> > > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > > >>>>>>> +     }
> > > > > > > > > >>>>>>> +}
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > > >>>>> point?
> > > > > > > > > >>>>
> > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > > >>>
> > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > > >>
> > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > > >>
> > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > > >>
> > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > > >> {
> > > > > > > > > >>          u64 hi, lo;
> > > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > > >>
> > > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > > >>
> > > > > > > > > >>          return hi | lo;
> > > > > > > > > >> }
> > > > > > > > > >>
> > > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > > >>
> > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > > >>                              u64 timestamp)
> > > > > > > > > >> {
> > > > > > > > > >>          unsigned int seq;
> > > > > > > > > >>          u64 nsec;
> > > > > > > > > >>
> > > > > > > > > >>          do {
> > > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > > >>
> > > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > > >> }
> > > > > > > > > >>
> > > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > > >>
> > > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > > >>
> > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > > >> and not BPF insns directly.
> > > > > > > > > >
> > > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > > >
> > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > > >
> > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > > >                                                u64 timestamp)
> > > > > > > > > > {
> > > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > > >
> > > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > > >
> > > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > > which will collect several bits of metadata).
> > > > > > > > >
> > > > > > > > > > csum may have a better chance to inline?
> > > > > > > > >
> > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > > before merging this.
> > > > > > > >
> > > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > > a use case for it now.
> > > > > > > >
> > > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > > >
> > > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > > rx side for a v2.
> > > > > >
> > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > > handle versioning of descriptors where depending on specific options
> > > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > > around. Of course can push this all to developer, but seems not so
> > > > > > nice when we have the machinery to do this and we handle it for all
> > > > > > other structures.
> > > > > >
> > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > > descriptor. If you go through normal path you get this for free of
> > > > > > course.
> > > > >
> > > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > > CO-RE would help.
> > > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > > mlx4?
> > > >
> > > > Which part are you talking about ?
> > > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > > which is what John is proposing (I think).

Yeah this is what I've been considering. If you just get the 'cqe' pointer
walking the check_sum and rxhash should be straightforward.

There seems to be a real difference between timestamps and most other 
fields IMO. Timestamps require some extra logic to turn into ns when
using the NIC hw clock. And the hw clock requires some coordination to
keep in sync and stop from overflowing and may be done through other
protocols like PTP in my use case. In some cases I think the clock is
also shared amongst multiple phys. Seems mlx has a seqlock_t to protect
it and I'm less sure about this but seems intel nic maybe has a sideband
control channel.

Then there is everything else that I can think up that is per packet data
and requires no coordination with the driver. Its just reading fields in
the completion queue. This would be the csum, check_sum, vlan_header and
so on. Sure we could kfunc each one of those things, but could also just
write that directly in BPF and remove some abstractions and kernel
dependency by doing it directly in the BPF program. If you like that
abstraction seems to be the point of contention my opinion is the cost
of kernel depency is high and I can abstract it with a user library
anyways so burying it in the kernel only causes me support issues and
backwards compat problems.

Hopefully, my position is more clear.

> > >
> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > > I'm assuming we want to have hash_type available to the progs?
> > >
> > > But also, check_csum handles other corner cases:
> > > - short_frame: we simply force all those small frames to skip checksum complete
> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > > IPv6 header
> > > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > > doesn't include the pseudo header, the HW adds it
> >
> > Of course, but kfunc won't be doing them either.
> > We're talking XDP fast path.
> > The mlx4 hw is old and incapable.
> > No amount of sw can help.

Doesn't this lend itself to letting the XDP BPF program write the
BPF code to read it out. Maybe someone cares about these details
for some cpumap thing, but the rest of us wont care we might just
want to read check_csum. Maybe we have an IPv6 only network or
IPv4 network so can make further shortcuts. If a driver dev does
this they will be forced to do the cactch all version because
they have no way to know such details.

> >
> > > So it doesn't look like we can just unconditionally use cqe->checksum?
> > > The driver does a lot of massaging around that field to make it
> > > palatable.
> >
> > Of course we can. cqe->checksum is still usable. the bpf prog
> > would need to know what it's reading.
> > There should be no attempt to present a unified state of hw bits.
> > That's what skb is for. XDP layer should not hide such hw details.
> > Otherwise it will become a mini skb layer with all that overhead.
> 
> I was hoping the kfunc could at least parse the flags and return some
> pkt_hash_types-like enum to indicate what this csum covers.
> So the users won't have to find the hardware manuals (not sure they
> are even available?) to decipher what numbers they've got.
> Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look
> that much different :-(

The driver developers could still provide and ship the BPF libs
with their drivers. I think if someone is going to use their NIC
and lots of them and requires XDP it will get done. We could put
them by the driver code mlx4.bpf or something.

> 
> But going back a bit: I'm probably missing what John has been
> suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing
> a CO-RE-like functionality by rewriting some "public api" (kfunc) into
> the bytecode to access the relevant data.

This was maybe a bit of an aside. What I was pondering a bit out
loud perhaps is my recollection is there may be a few different
descriptor layouts depending features enabled, exact device loaded
and such. So in this case if I was a driver writer I might not want
to hardcode the offset of the check_sum field. If I could use CO-RE
then I don't have to care if in one version is the Nth field and later on
someone makes it the Mth field just like any normal kernel struct.
But through the kfunc interface I couldn't see how to get that.
So instead of having a bunch of kfunc implementations you could just
have one for all your device classes because you always name the
field the same thing.
Toke Høiland-Jørgensen Nov. 17, 2022, 10:27 a.m. UTC | #18
Just a separate comment on this bit:

John Fastabend <john.fastabend@gmail.com> writes:
> If you go with in-kernel BPF kfunc approach (vs user space side) I think
> you also need to add CO-RE to be friendly for driver developers? Otherwise
> they have to keep that read in sync with the descriptors?

CO-RE is for doing relocations of field offsets without having to
recompile. That's not really relevant for the kernel, that gets
recompiled whenever the layout changes. So the field offsets are just
kept in sync with offsetof(), like in Stanislav's RFCv2 where he had
this snippet:

+			/*	return ((struct sk_buff *)r5)->tstamp; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_5,
+				    offsetof(struct sk_buff, tstamp)),

So I definitely don't think this is an argument against the kfunc
approach?

> Also need to handle versioning of descriptors where depending on
> specific options and firmware and chip being enabled the descriptor
> might be moving around.

This is exactly the kind of thing the driver is supposed to take care
of; it knows the hardware configuration and can pick the right
descriptor format. Either by just picking an entirely different kfunc
unroll depending on the config (if it's static), or by adding the right
checks to the unroll. You'd have to replicate all this logic in BPF
anyway, and while I'm not doubting *you* are capable of doing this, I
don't think we should be forcing every XDP developer to deal with all
this.

Or to put it another way, a proper hardware abstraction and high-quality
drivers is one of the main selling points of XDP over DPDK and other
kernel bypass solutions; we should not jettison this when enabling
metadata support!

-Toke
Toke Høiland-Jørgensen Nov. 17, 2022, 11:32 a.m. UTC | #19
Stanislav Fomichev <sdf@google.com> writes:

>> > Doesn't look like the descriptors are as nice as you're trying to
>> > paint them (with clear hash/csum fields) :-) So not sure how much
>> > CO-RE would help.
>> > At least looking at mlx4 rx_csum, the driver consults three different
>> > sets of flags to figure out the hash_type. Or am I just unlucky with
>> > mlx4?
>>
>> Which part are you talking about ?
>>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
>> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
>> which is what John is proposing (I think).
>
> I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> I'm assuming we want to have hash_type available to the progs?

I agree we should expose the hash_type, but that doesn't actually look
to be that complicated, see below.

> But also, check_csum handles other corner cases:
> - short_frame: we simply force all those small frames to skip checksum complete
> - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> IPv6 header
> - get_fixed_ipv4_csum: Although the stack expects checksum which
> doesn't include the pseudo header, the HW adds it
>
> So it doesn't look like we can just unconditionally use cqe->checksum?
> The driver does a lot of massaging around that field to make it
> palatable.

Poking around a bit in the other drivers, AFAICT it's only a subset of
drivers that support CSUM_COMPLETE at all; for instance, the Intel
drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
propagate?

AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
to other data structures than the rx descriptor to determine the status
of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
just exposing the rx descriptor to BPF as John is suggesting does not
actually give the XDP program enough information to act on the checksum
field on its own. We could still have a separate kfunc to just expose
the hw checksum value (see below), but I think it probably needs to be
paired with other kfuncs to be useful.

Looking at the mlx4 code, I think the following mapping to kfuncs (in
pseudo-C) would give the flexibility for XDP to access all the bits it
needs, while inlining everything except getting the full checksum for
non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
drivers (mlx5, ice, i40e) indicates that this would work for those
drivers as well.


bpf_xdp_metadata_rx_hash_supported() {
  return dev->features & NETIF_F_RXHASH;
}

bpf_xdp_metadata_rx_hash() {
  return be32_to_cpu(cqe->immed_rss_invalid);
}

bpf_xdp_metdata_rx_hash_type() {
  if (likely(dev->features & NETIF_F_RXCSUM) &&
      (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
	(cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
	  cqe->checksum == cpu_to_be16(0xffff))
     return PKT_HASH_TYPE_L4;

   return PKT_HASH_TYPE_L3;
}

bpf_xdp_metadata_rx_csum_supported() {
  return dev->features & NETIF_F_RXCSUM;
}

bpf_xdp_metadata_rx_csum_level() {
	if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
				       MLX4_CQE_STATUS_UDP)) &&
	    (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
	    cqe->checksum == cpu_to_be16(0xffff))
            return CHECKSUM_UNNECESSARY;
            
	if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
	      (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
              !short_frame(len))
            return CHECKSUM_COMPLETE; /* we could also omit this case entirely */

        return CHECKSUM_NONE;
}

/* this one could be called by the metadata_to_skb code */
bpf_xdp_metadata_rx_csum_full() {
  return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
}

/* this one would be for people like John who want to re-implement
 * check_csum() themselves */
bpf_xdp_metdata_rx_csum_raw() {
  return cqe->checksum;
}


-Toke
Alexei Starovoitov Nov. 17, 2022, 4:59 p.m. UTC | #20
On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> >> > Doesn't look like the descriptors are as nice as you're trying to
> >> > paint them (with clear hash/csum fields) :-) So not sure how much
> >> > CO-RE would help.
> >> > At least looking at mlx4 rx_csum, the driver consults three different
> >> > sets of flags to figure out the hash_type. Or am I just unlucky with
> >> > mlx4?
> >>
> >> Which part are you talking about ?
> >>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> >> which is what John is proposing (I think).
> >
> > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > I'm assuming we want to have hash_type available to the progs?
>
> I agree we should expose the hash_type, but that doesn't actually look
> to be that complicated, see below.
>
> > But also, check_csum handles other corner cases:
> > - short_frame: we simply force all those small frames to skip checksum complete
> > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > IPv6 header
> > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > doesn't include the pseudo header, the HW adds it
> >
> > So it doesn't look like we can just unconditionally use cqe->checksum?
> > The driver does a lot of massaging around that field to make it
> > palatable.
>
> Poking around a bit in the other drivers, AFAICT it's only a subset of
> drivers that support CSUM_COMPLETE at all; for instance, the Intel
> drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
> CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
> propagate?
>
> AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
> to other data structures than the rx descriptor to determine the status
> of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
> just exposing the rx descriptor to BPF as John is suggesting does not
> actually give the XDP program enough information to act on the checksum
> field on its own. We could still have a separate kfunc to just expose
> the hw checksum value (see below), but I think it probably needs to be
> paired with other kfuncs to be useful.
>
> Looking at the mlx4 code, I think the following mapping to kfuncs (in
> pseudo-C) would give the flexibility for XDP to access all the bits it
> needs, while inlining everything except getting the full checksum for
> non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
> drivers (mlx5, ice, i40e) indicates that this would work for those
> drivers as well.
>
>
> bpf_xdp_metadata_rx_hash_supported() {
>   return dev->features & NETIF_F_RXHASH;
> }
>
> bpf_xdp_metadata_rx_hash() {
>   return be32_to_cpu(cqe->immed_rss_invalid);
> }
>
> bpf_xdp_metdata_rx_hash_type() {
>   if (likely(dev->features & NETIF_F_RXCSUM) &&
>       (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
>         (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>           cqe->checksum == cpu_to_be16(0xffff))
>      return PKT_HASH_TYPE_L4;
>
>    return PKT_HASH_TYPE_L3;
> }
>
> bpf_xdp_metadata_rx_csum_supported() {
>   return dev->features & NETIF_F_RXCSUM;
> }
>
> bpf_xdp_metadata_rx_csum_level() {
>         if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>                                        MLX4_CQE_STATUS_UDP)) &&
>             (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>             cqe->checksum == cpu_to_be16(0xffff))
>             return CHECKSUM_UNNECESSARY;
>
>         if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
>               (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
>               !short_frame(len))
>             return CHECKSUM_COMPLETE; /* we could also omit this case entirely */
>
>         return CHECKSUM_NONE;
> }
>
> /* this one could be called by the metadata_to_skb code */
> bpf_xdp_metadata_rx_csum_full() {
>   return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
> }
>
> /* this one would be for people like John who want to re-implement
>  * check_csum() themselves */
> bpf_xdp_metdata_rx_csum_raw() {
>   return cqe->checksum;
> }

Are you proposing a bunch of per-driver kfuncs that bpf prog will call.
If so that works, but bpf prog needs to pass dev and cqe pointers
into these kfuncs, so they need to be exposed to the prog somehow.
Probably through xdp_md ?
This way we can have both: bpf prog reading cqe fields directly
and using kfuncs to access things.
Inlining of kfuncs should be done generically.
It's not a driver job to convert native asm into bpf asm.
Stanislav Fomichev Nov. 17, 2022, 5:51 p.m. UTC | #21
On Wed, Nov 16, 2022 at 10:55 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > >
> > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > > > >
> > > > > > > Stanislav Fomichev wrote:
> > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > > > >
> > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > > > >>>>>>> +{
> > > > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > > > >>>>>>> +     }
> > > > > > > > > > >>>>>>> +}
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > > > >>>>> point?
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > > > >>>
> > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > > > >>
> > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > > > >>
> > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > > > >>
> > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > > > >> {
> > > > > > > > > > >>          u64 hi, lo;
> > > > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > > > >>
> > > > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > > > >>
> > > > > > > > > > >>          return hi | lo;
> > > > > > > > > > >> }
> > > > > > > > > > >>
> > > > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > > > >>
> > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > > > >>                              u64 timestamp)
> > > > > > > > > > >> {
> > > > > > > > > > >>          unsigned int seq;
> > > > > > > > > > >>          u64 nsec;
> > > > > > > > > > >>
> > > > > > > > > > >>          do {
> > > > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > > > >>
> > > > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > > > >> }
> > > > > > > > > > >>
> > > > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > > > >>
> > > > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > > > >>
> > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > > > >> and not BPF insns directly.
> > > > > > > > > > >
> > > > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > > > >
> > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > > > >
> > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > > > >                                                u64 timestamp)
> > > > > > > > > > > {
> > > > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > > > >
> > > > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > > > >
> > > > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > > > which will collect several bits of metadata).
> > > > > > > > > >
> > > > > > > > > > > csum may have a better chance to inline?
> > > > > > > > > >
> > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > > > before merging this.
> > > > > > > > >
> > > > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > > > a use case for it now.
> > > > > > > > >
> > > > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > > > >
> > > > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > > > rx side for a v2.
> > > > > > >
> > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > > > handle versioning of descriptors where depending on specific options
> > > > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > > > around. Of course can push this all to developer, but seems not so
> > > > > > > nice when we have the machinery to do this and we handle it for all
> > > > > > > other structures.
> > > > > > >
> > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > > > descriptor. If you go through normal path you get this for free of
> > > > > > > course.
> > > > > >
> > > > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > > > CO-RE would help.
> > > > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > > > mlx4?
> > > > >
> > > > > Which part are you talking about ?
> > > > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > > > which is what John is proposing (I think).
>
> Yeah this is what I've been considering. If you just get the 'cqe' pointer
> walking the check_sum and rxhash should be straightforward.
>
> There seems to be a real difference between timestamps and most other
> fields IMO. Timestamps require some extra logic to turn into ns when
> using the NIC hw clock. And the hw clock requires some coordination to
> keep in sync and stop from overflowing and may be done through other
> protocols like PTP in my use case. In some cases I think the clock is
> also shared amongst multiple phys. Seems mlx has a seqlock_t to protect
> it and I'm less sure about this but seems intel nic maybe has a sideband
> control channel.
>
> Then there is everything else that I can think up that is per packet data
> and requires no coordination with the driver. Its just reading fields in
> the completion queue. This would be the csum, check_sum, vlan_header and
> so on. Sure we could kfunc each one of those things, but could also just
> write that directly in BPF and remove some abstractions and kernel
> dependency by doing it directly in the BPF program. If you like that
> abstraction seems to be the point of contention my opinion is the cost
> of kernel depency is high and I can abstract it with a user library
> anyways so burying it in the kernel only causes me support issues and
> backwards compat problems.
>
> Hopefully, my position is more clear.

Yeah, I see your point, I'm somewhat in the same position here wrt to
legacy kernels :-)
Exposing raw descriptors seems fine, but imo it shouldn't be the goto
mechanism for the metadata; but rather as a fallback whenever the
driver implementation is missing/buggy. Unless, as you mention below,
there are some libraries in the future to abstract that.
But at least it seems that we agree that there needs to be some other
non-raw-descriptor way to access spinlocked things like the timestamp?


> > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > > > I'm assuming we want to have hash_type available to the progs?
> > > >
> > > > But also, check_csum handles other corner cases:
> > > > - short_frame: we simply force all those small frames to skip checksum complete
> > > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > > > IPv6 header
> > > > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > > > doesn't include the pseudo header, the HW adds it
> > >
> > > Of course, but kfunc won't be doing them either.
> > > We're talking XDP fast path.
> > > The mlx4 hw is old and incapable.
> > > No amount of sw can help.
>
> Doesn't this lend itself to letting the XDP BPF program write the
> BPF code to read it out. Maybe someone cares about these details
> for some cpumap thing, but the rest of us wont care we might just
> want to read check_csum. Maybe we have an IPv6 only network or
> IPv4 network so can make further shortcuts. If a driver dev does
> this they will be forced to do the cactch all version because
> they have no way to know such details.
>
> > > > So it doesn't look like we can just unconditionally use cqe->checksum?
> > > > The driver does a lot of massaging around that field to make it
> > > > palatable.
> > >
> > > Of course we can. cqe->checksum is still usable. the bpf prog
> > > would need to know what it's reading.
> > > There should be no attempt to present a unified state of hw bits.
> > > That's what skb is for. XDP layer should not hide such hw details.
> > > Otherwise it will become a mini skb layer with all that overhead.
> >
> > I was hoping the kfunc could at least parse the flags and return some
> > pkt_hash_types-like enum to indicate what this csum covers.
> > So the users won't have to find the hardware manuals (not sure they
> > are even available?) to decipher what numbers they've got.
> > Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look
> > that much different :-(
>
> The driver developers could still provide and ship the BPF libs
> with their drivers. I think if someone is going to use their NIC
> and lots of them and requires XDP it will get done. We could put
> them by the driver code mlx4.bpf or something.
>
> >
> > But going back a bit: I'm probably missing what John has been
> > suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing
> > a CO-RE-like functionality by rewriting some "public api" (kfunc) into
> > the bytecode to access the relevant data.
>
> This was maybe a bit of an aside. What I was pondering a bit out
> loud perhaps is my recollection is there may be a few different
> descriptor layouts depending features enabled, exact device loaded
> and such. So in this case if I was a driver writer I might not want
> to hardcode the offset of the check_sum field. If I could use CO-RE
> then I don't have to care if in one version is the Nth field and later on
> someone makes it the Mth field just like any normal kernel struct.
> But through the kfunc interface I couldn't see how to get that.
> So instead of having a bunch of kfunc implementations you could just
> have one for all your device classes because you always name the
> field the same thing.
Stanislav Fomichev Nov. 17, 2022, 5:52 p.m. UTC | #22
On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > >> > Doesn't look like the descriptors are as nice as you're trying to
> > >> > paint them (with clear hash/csum fields) :-) So not sure how much
> > >> > CO-RE would help.
> > >> > At least looking at mlx4 rx_csum, the driver consults three different
> > >> > sets of flags to figure out the hash_type. Or am I just unlucky with
> > >> > mlx4?
> > >>
> > >> Which part are you talking about ?
> > >>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > >> which is what John is proposing (I think).
> > >
> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > > I'm assuming we want to have hash_type available to the progs?
> >
> > I agree we should expose the hash_type, but that doesn't actually look
> > to be that complicated, see below.
> >
> > > But also, check_csum handles other corner cases:
> > > - short_frame: we simply force all those small frames to skip checksum complete
> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > > IPv6 header
> > > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > > doesn't include the pseudo header, the HW adds it
> > >
> > > So it doesn't look like we can just unconditionally use cqe->checksum?
> > > The driver does a lot of massaging around that field to make it
> > > palatable.
> >
> > Poking around a bit in the other drivers, AFAICT it's only a subset of
> > drivers that support CSUM_COMPLETE at all; for instance, the Intel
> > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
> > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
> > propagate?
> >
> > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
> > to other data structures than the rx descriptor to determine the status
> > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
> > just exposing the rx descriptor to BPF as John is suggesting does not
> > actually give the XDP program enough information to act on the checksum
> > field on its own. We could still have a separate kfunc to just expose
> > the hw checksum value (see below), but I think it probably needs to be
> > paired with other kfuncs to be useful.
> >
> > Looking at the mlx4 code, I think the following mapping to kfuncs (in
> > pseudo-C) would give the flexibility for XDP to access all the bits it
> > needs, while inlining everything except getting the full checksum for
> > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
> > drivers (mlx5, ice, i40e) indicates that this would work for those
> > drivers as well.
> >
> >
> > bpf_xdp_metadata_rx_hash_supported() {
> >   return dev->features & NETIF_F_RXHASH;
> > }
> >
> > bpf_xdp_metadata_rx_hash() {
> >   return be32_to_cpu(cqe->immed_rss_invalid);
> > }
> >
> > bpf_xdp_metdata_rx_hash_type() {
> >   if (likely(dev->features & NETIF_F_RXCSUM) &&
> >       (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
> >         (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> >           cqe->checksum == cpu_to_be16(0xffff))
> >      return PKT_HASH_TYPE_L4;
> >
> >    return PKT_HASH_TYPE_L3;
> > }
> >
> > bpf_xdp_metadata_rx_csum_supported() {
> >   return dev->features & NETIF_F_RXCSUM;
> > }
> >
> > bpf_xdp_metadata_rx_csum_level() {
> >         if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
> >                                        MLX4_CQE_STATUS_UDP)) &&
> >             (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> >             cqe->checksum == cpu_to_be16(0xffff))
> >             return CHECKSUM_UNNECESSARY;
> >
> >         if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
> >               (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
> >               !short_frame(len))
> >             return CHECKSUM_COMPLETE; /* we could also omit this case entirely */
> >
> >         return CHECKSUM_NONE;
> > }
> >
> > /* this one could be called by the metadata_to_skb code */
> > bpf_xdp_metadata_rx_csum_full() {
> >   return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
> > }
> >
> > /* this one would be for people like John who want to re-implement
> >  * check_csum() themselves */
> > bpf_xdp_metdata_rx_csum_raw() {
> >   return cqe->checksum;
> > }
>
> Are you proposing a bunch of per-driver kfuncs that bpf prog will call.
> If so that works, but bpf prog needs to pass dev and cqe pointers
> into these kfuncs, so they need to be exposed to the prog somehow.
> Probably through xdp_md ?

So far I'm doing:

struct mlx4_xdp_buff {
  struct xdp_buff xdp;
  struct mlx4_cqe *cqe;
  struct mlx4_en_dev *mdev;
}

And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can
find cqe/mdev via container_of.

If we really need these to be exposed to the program, can we use
Yonghong's approach from [0]?

0: https://lore.kernel.org/bpf/20221114162328.622665-1-yhs@fb.com/

> This way we can have both: bpf prog reading cqe fields directly
> and using kfuncs to access things.
> Inlining of kfuncs should be done generically.
> It's not a driver job to convert native asm into bpf asm.

Ack. I can replace the unrolling with something that just resolves
"generic" kfuncs to the per-driver implementation maybe? That would at
least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..
John Fastabend Nov. 17, 2022, 7:47 p.m. UTC | #23
Stanislav Fomichev wrote:
> On Wed, Nov 16, 2022 at 10:55 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Stanislav Fomichev wrote:
> > > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Stanislav Fomichev wrote:
> > > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > > > > >
> > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > > > > >>>>>>> +{
> > > > > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > > > > >>>>>>> +     }
> > > > > > > > > > > >>>>>>> +}
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > > > > >>>>> point?
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > > > > >>
> > > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > > > > >>
> > > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > > > > >>
> > > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > > > > >> {
> > > > > > > > > > > >>          u64 hi, lo;
> > > > > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > > > > >>
> > > > > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > > > > >>
> > > > > > > > > > > >>          return hi | lo;
> > > > > > > > > > > >> }
> > > > > > > > > > > >>
> > > > > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > > > > >>
> > > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > > > > >>                              u64 timestamp)
> > > > > > > > > > > >> {
> > > > > > > > > > > >>          unsigned int seq;
> > > > > > > > > > > >>          u64 nsec;
> > > > > > > > > > > >>
> > > > > > > > > > > >>          do {
> > > > > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > > > > >>
> > > > > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > > > > >> }
> > > > > > > > > > > >>
> > > > > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > > > > >>
> > > > > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > > > > >> and not BPF insns directly.
> > > > > > > > > > > >
> > > > > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > > > > >
> > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > > > > >
> > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > > > > >                                                u64 timestamp)
> > > > > > > > > > > > {
> > > > > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > > > > >
> > > > > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > > > > >
> > > > > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > > > > which will collect several bits of metadata).
> > > > > > > > > > >
> > > > > > > > > > > > csum may have a better chance to inline?
> > > > > > > > > > >
> > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > > > > before merging this.
> > > > > > > > > >
> > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > > > > a use case for it now.
> > > > > > > > > >
> > > > > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > > > > >
> > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > > > > rx side for a v2.
> > > > > > > >
> > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > > > > handle versioning of descriptors where depending on specific options
> > > > > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > > > > around. Of course can push this all to developer, but seems not so
> > > > > > > > nice when we have the machinery to do this and we handle it for all
> > > > > > > > other structures.
> > > > > > > >
> > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > > > > descriptor. If you go through normal path you get this for free of
> > > > > > > > course.
> > > > > > >
> > > > > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > > > > CO-RE would help.
> > > > > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > > > > mlx4?
> > > > > >
> > > > > > Which part are you talking about ?
> > > > > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > > > > which is what John is proposing (I think).
> >
> > Yeah this is what I've been considering. If you just get the 'cqe' pointer
> > walking the check_sum and rxhash should be straightforward.
> >
> > There seems to be a real difference between timestamps and most other
> > fields IMO. Timestamps require some extra logic to turn into ns when
> > using the NIC hw clock. And the hw clock requires some coordination to
> > keep in sync and stop from overflowing and may be done through other
> > protocols like PTP in my use case. In some cases I think the clock is
> > also shared amongst multiple phys. Seems mlx has a seqlock_t to protect
> > it and I'm less sure about this but seems intel nic maybe has a sideband
> > control channel.
> >
> > Then there is everything else that I can think up that is per packet data
> > and requires no coordination with the driver. Its just reading fields in
> > the completion queue. This would be the csum, check_sum, vlan_header and
> > so on. Sure we could kfunc each one of those things, but could also just
> > write that directly in BPF and remove some abstractions and kernel
> > dependency by doing it directly in the BPF program. If you like that
> > abstraction seems to be the point of contention my opinion is the cost
> > of kernel depency is high and I can abstract it with a user library
> > anyways so burying it in the kernel only causes me support issues and
> > backwards compat problems.
> >
> > Hopefully, my position is more clear.
> 
> Yeah, I see your point, I'm somewhat in the same position here wrt to
> legacy kernels :-)
> Exposing raw descriptors seems fine, but imo it shouldn't be the goto
> mechanism for the metadata; but rather as a fallback whenever the
> driver implementation is missing/buggy. Unless, as you mention below,
> there are some libraries in the future to abstract that.
> But at least it seems that we agree that there needs to be some other
> non-raw-descriptor way to access spinlocked things like the timestamp?
> 

Yeah for timestamps I think a kfunc to either get the timestamp or could
also be done with a kfunc to read hw clock. But either way seems hard
to do that in BPF code directly so kfunc feels right to me here.

By the way I think if you have the completion queue (rx descriptor) in
the xdp_buff and we use Yonghong's patch to cast the ctx as a BTF type
then we should be able to also directly read all the fields. I see
you noted this in the response to Alexei so lets see what he thinks.
Alexei Starovoitov Nov. 17, 2022, 8:17 p.m. UTC | #24
On Thu, Nov 17, 2022 at 11:47 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Yeah for timestamps I think a kfunc to either get the timestamp or could
> also be done with a kfunc to read hw clock. But either way seems hard
> to do that in BPF code directly so kfunc feels right to me here.
>
> By the way I think if you have the completion queue (rx descriptor) in
> the xdp_buff and we use Yonghong's patch to cast the ctx as a BTF type
> then we should be able to also directly read all the fields. I see
> you noted this in the response to Alexei so lets see what he thinks.

Fine with me.
Let's land something that is not uapi and then iterate on top.
Toke Høiland-Jørgensen Nov. 17, 2022, 11:46 p.m. UTC | #25
Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >
>> > Stanislav Fomichev <sdf@google.com> writes:
>> >
>> > >> > Doesn't look like the descriptors are as nice as you're trying to
>> > >> > paint them (with clear hash/csum fields) :-) So not sure how much
>> > >> > CO-RE would help.
>> > >> > At least looking at mlx4 rx_csum, the driver consults three different
>> > >> > sets of flags to figure out the hash_type. Or am I just unlucky with
>> > >> > mlx4?
>> > >>
>> > >> Which part are you talking about ?
>> > >>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
>> > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
>> > >> which is what John is proposing (I think).
>> > >
>> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
>> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
>> > > I'm assuming we want to have hash_type available to the progs?
>> >
>> > I agree we should expose the hash_type, but that doesn't actually look
>> > to be that complicated, see below.
>> >
>> > > But also, check_csum handles other corner cases:
>> > > - short_frame: we simply force all those small frames to skip checksum complete
>> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
>> > > IPv6 header
>> > > - get_fixed_ipv4_csum: Although the stack expects checksum which
>> > > doesn't include the pseudo header, the HW adds it
>> > >
>> > > So it doesn't look like we can just unconditionally use cqe->checksum?
>> > > The driver does a lot of massaging around that field to make it
>> > > palatable.
>> >
>> > Poking around a bit in the other drivers, AFAICT it's only a subset of
>> > drivers that support CSUM_COMPLETE at all; for instance, the Intel
>> > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
>> > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
>> > propagate?
>> >
>> > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
>> > to other data structures than the rx descriptor to determine the status
>> > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
>> > just exposing the rx descriptor to BPF as John is suggesting does not
>> > actually give the XDP program enough information to act on the checksum
>> > field on its own. We could still have a separate kfunc to just expose
>> > the hw checksum value (see below), but I think it probably needs to be
>> > paired with other kfuncs to be useful.
>> >
>> > Looking at the mlx4 code, I think the following mapping to kfuncs (in
>> > pseudo-C) would give the flexibility for XDP to access all the bits it
>> > needs, while inlining everything except getting the full checksum for
>> > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
>> > drivers (mlx5, ice, i40e) indicates that this would work for those
>> > drivers as well.
>> >
>> >
>> > bpf_xdp_metadata_rx_hash_supported() {
>> >   return dev->features & NETIF_F_RXHASH;
>> > }
>> >
>> > bpf_xdp_metadata_rx_hash() {
>> >   return be32_to_cpu(cqe->immed_rss_invalid);
>> > }
>> >
>> > bpf_xdp_metdata_rx_hash_type() {
>> >   if (likely(dev->features & NETIF_F_RXCSUM) &&
>> >       (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
>> >         (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>> >           cqe->checksum == cpu_to_be16(0xffff))
>> >      return PKT_HASH_TYPE_L4;
>> >
>> >    return PKT_HASH_TYPE_L3;
>> > }
>> >
>> > bpf_xdp_metadata_rx_csum_supported() {
>> >   return dev->features & NETIF_F_RXCSUM;
>> > }
>> >
>> > bpf_xdp_metadata_rx_csum_level() {
>> >         if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>> >                                        MLX4_CQE_STATUS_UDP)) &&
>> >             (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>> >             cqe->checksum == cpu_to_be16(0xffff))
>> >             return CHECKSUM_UNNECESSARY;
>> >
>> >         if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
>> >               (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
>> >               !short_frame(len))
>> >             return CHECKSUM_COMPLETE; /* we could also omit this case entirely */
>> >
>> >         return CHECKSUM_NONE;
>> > }
>> >
>> > /* this one could be called by the metadata_to_skb code */
>> > bpf_xdp_metadata_rx_csum_full() {
>> >   return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
>> > }
>> >
>> > /* this one would be for people like John who want to re-implement
>> >  * check_csum() themselves */
>> > bpf_xdp_metdata_rx_csum_raw() {
>> >   return cqe->checksum;
>> > }
>>
>> Are you proposing a bunch of per-driver kfuncs that bpf prog will call.
>> If so that works, but bpf prog needs to pass dev and cqe pointers
>> into these kfuncs, so they need to be exposed to the prog somehow.
>> Probably through xdp_md ?

No, I didn't mean we should call per-driver kfuncs; the examples above
were meant to be examples of what the mlx4 driver would unrolls those
kfuncs to. Sorry that that wasn't clear.

> So far I'm doing:
>
> struct mlx4_xdp_buff {
>   struct xdp_buff xdp;
>   struct mlx4_cqe *cqe;
>   struct mlx4_en_dev *mdev;
> }
>
> And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can
> find cqe/mdev via container_of.
>
> If we really need these to be exposed to the program, can we use
> Yonghong's approach from [0]?

I don't think we should expose them to the BPF program; I like your
approach of stuffing them next to the CTX pointer and de-referencing
that. This makes it up to the driver which extra objects it needs, and
the caller doesn't have to know/care.

I'm not vehemently opposed to *also* having the rx-desc pointer directly
accessible (in which case Yonghong's kfunc approach is probably fine).
However, as mentioned in my previous email, I doubt how useful that
descriptor itself will be...

>> This way we can have both: bpf prog reading cqe fields directly
>> and using kfuncs to access things.
>> Inlining of kfuncs should be done generically.
>> It's not a driver job to convert native asm into bpf asm.
>
> Ack. I can replace the unrolling with something that just resolves
> "generic" kfuncs to the per-driver implementation maybe? That would at
> least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..

As stated above, I think we should keep the unrolling. If we end up with
an actual CALL instruction for every piece of metadata that's going to
suck performance-wise; unrolling is how we keep this fast enough! :)

-Toke
Alexei Starovoitov Nov. 18, 2022, 12:02 a.m. UTC | #26
On Thu, Nov 17, 2022 at 3:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> >
> > Ack. I can replace the unrolling with something that just resolves
> > "generic" kfuncs to the per-driver implementation maybe? That would at
> > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..
>
> As stated above, I think we should keep the unrolling. If we end up with
> an actual CALL instruction for every piece of metadata that's going to
> suck performance-wise; unrolling is how we keep this fast enough! :)

Let's start with pure kfuncs without requiring drivers
to provide corresponding bpf asm.
If pure kfuncs will indeed turn out to be perf limiting
then we'll decide on how to optimize them.
Toke Høiland-Jørgensen Nov. 18, 2022, 12:29 a.m. UTC | #27
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Nov 17, 2022 at 3:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> >
>> > Ack. I can replace the unrolling with something that just resolves
>> > "generic" kfuncs to the per-driver implementation maybe? That would at
>> > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..
>>
>> As stated above, I think we should keep the unrolling. If we end up with
>> an actual CALL instruction for every piece of metadata that's going to
>> suck performance-wise; unrolling is how we keep this fast enough! :)
>
> Let's start with pure kfuncs without requiring drivers
> to provide corresponding bpf asm.
> If pure kfuncs will indeed turn out to be perf limiting
> then we'll decide on how to optimize them.

I'm ~90% sure we'll need that optimisation, but OK, we can start from a
baseline of having them be function calls...

-Toke
diff mbox series

Patch

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 2a4592780141..c626580a2294 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -25,6 +25,7 @@ 
 #include <linux/filter.h>
 #include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
+#include <linux/bpf_patch.h>
 #include <linux/net_tstamp.h>
 
 #define DRV_NAME	"veth"
@@ -1659,6 +1660,18 @@  static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+			      struct bpf_patch *patch)
+{
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+		/* return true; */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		/* return ktime_get_mono_fast_ns(); */
+		bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1678,6 +1691,7 @@  static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_unroll_kfunc       = veth_unroll_kfunc,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \