Message ID | 20221115030210.3159213-6-sdf@google.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | xdp: hints via kfuncs | expand |
Stanislav Fomichev <sdf@google.com> writes: > The goal is to enable end-to-end testing of the metadata > for AF_XDP. Current rx_timestamp kfunc returns current > time which should be enough to exercise this new functionality. > > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: David Ahern <dsahern@gmail.com> > Cc: Martin KaFai Lau <martin.lau@linux.dev> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Willem de Bruijn <willemb@google.com> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > Cc: Anatoly Burakov <anatoly.burakov@intel.com> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com> > Cc: Maryam Tahhan <mtahhan@redhat.com> > Cc: xdp-hints@xdp-project.net > Cc: netdev@vger.kernel.org > Signed-off-by: Stanislav Fomichev <sdf@google.com> > --- > drivers/net/veth.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > index 2a4592780141..c626580a2294 100644 > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -25,6 +25,7 @@ > #include <linux/filter.h> > #include <linux/ptr_ring.h> > #include <linux/bpf_trace.h> > +#include <linux/bpf_patch.h> > #include <linux/net_tstamp.h> > > #define DRV_NAME "veth" > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) > } > } > > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > + struct bpf_patch *patch) > +{ > + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > + /* return true; */ > + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > + /* return ktime_get_mono_fast_ns(); */ > + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > + } > +} So these look reasonable enough, but would be good to see some examples of kfunc implementations that don't just BPF_CALL to a kernel function (with those helper wrappers we were discussing before). -Toke
On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Stanislav Fomichev <sdf@google.com> writes: > > > The goal is to enable end-to-end testing of the metadata > > for AF_XDP. Current rx_timestamp kfunc returns current > > time which should be enough to exercise this new functionality. > > > > Cc: John Fastabend <john.fastabend@gmail.com> > > Cc: David Ahern <dsahern@gmail.com> > > Cc: Martin KaFai Lau <martin.lau@linux.dev> > > Cc: Jakub Kicinski <kuba@kernel.org> > > Cc: Willem de Bruijn <willemb@google.com> > > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > > Cc: Anatoly Burakov <anatoly.burakov@intel.com> > > Cc: Alexander Lobakin <alexandr.lobakin@intel.com> > > Cc: Magnus Karlsson <magnus.karlsson@gmail.com> > > Cc: Maryam Tahhan <mtahhan@redhat.com> > > Cc: xdp-hints@xdp-project.net > > Cc: netdev@vger.kernel.org > > Signed-off-by: Stanislav Fomichev <sdf@google.com> > > --- > > drivers/net/veth.c | 14 ++++++++++++++ > > 1 file changed, 14 insertions(+) > > > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > > index 2a4592780141..c626580a2294 100644 > > --- a/drivers/net/veth.c > > +++ b/drivers/net/veth.c > > @@ -25,6 +25,7 @@ > > #include <linux/filter.h> > > #include <linux/ptr_ring.h> > > #include <linux/bpf_trace.h> > > +#include <linux/bpf_patch.h> > > #include <linux/net_tstamp.h> > > > > #define DRV_NAME "veth" > > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) > > } > > } > > > > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > + struct bpf_patch *patch) > > +{ > > + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > + /* return true; */ > > + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > + /* return ktime_get_mono_fast_ns(); */ > > + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > + } > > +} > > So these look reasonable enough, but would be good to see some examples > of kfunc implementations that don't just BPF_CALL to a kernel function > (with those helper wrappers we were discussing before). Let's maybe add them if/when needed as we add more metadata support? xdp_metadata_export_to_skb has an example, and rfc 1/2 have more examples, so it shouldn't be a problem to resurrect them back at some point?
Stanislav Fomichev <sdf@google.com> writes: > On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Stanislav Fomichev <sdf@google.com> writes: >> >> > The goal is to enable end-to-end testing of the metadata >> > for AF_XDP. Current rx_timestamp kfunc returns current >> > time which should be enough to exercise this new functionality. >> > >> > Cc: John Fastabend <john.fastabend@gmail.com> >> > Cc: David Ahern <dsahern@gmail.com> >> > Cc: Martin KaFai Lau <martin.lau@linux.dev> >> > Cc: Jakub Kicinski <kuba@kernel.org> >> > Cc: Willem de Bruijn <willemb@google.com> >> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> >> > Cc: Anatoly Burakov <anatoly.burakov@intel.com> >> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com> >> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com> >> > Cc: Maryam Tahhan <mtahhan@redhat.com> >> > Cc: xdp-hints@xdp-project.net >> > Cc: netdev@vger.kernel.org >> > Signed-off-by: Stanislav Fomichev <sdf@google.com> >> > --- >> > drivers/net/veth.c | 14 ++++++++++++++ >> > 1 file changed, 14 insertions(+) >> > >> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c >> > index 2a4592780141..c626580a2294 100644 >> > --- a/drivers/net/veth.c >> > +++ b/drivers/net/veth.c >> > @@ -25,6 +25,7 @@ >> > #include <linux/filter.h> >> > #include <linux/ptr_ring.h> >> > #include <linux/bpf_trace.h> >> > +#include <linux/bpf_patch.h> >> > #include <linux/net_tstamp.h> >> > >> > #define DRV_NAME "veth" >> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) >> > } >> > } >> > >> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, >> > + struct bpf_patch *patch) >> > +{ >> > + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { >> > + /* return true; */ >> > + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); >> > + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { >> > + /* return ktime_get_mono_fast_ns(); */ >> > + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); >> > + } >> > +} >> >> So these look reasonable enough, but would be good to see some examples >> of kfunc implementations that don't just BPF_CALL to a kernel function >> (with those helper wrappers we were discussing before). > > Let's maybe add them if/when needed as we add more metadata support? > xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > examples, so it shouldn't be a problem to resurrect them back at some > point? Well, the reason I asked for them is that I think having to maintain the BPF code generation in the drivers is probably the biggest drawback of the kfunc approach, so it would be good to be relatively sure that we can manage that complexity (via helpers) before we commit to this :) -Toke
On Tue, Nov 15, 2022 at 2:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Stanislav Fomichev <sdf@google.com> writes: > > > On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> Stanislav Fomichev <sdf@google.com> writes: > >> > >> > The goal is to enable end-to-end testing of the metadata > >> > for AF_XDP. Current rx_timestamp kfunc returns current > >> > time which should be enough to exercise this new functionality. > >> > > >> > Cc: John Fastabend <john.fastabend@gmail.com> > >> > Cc: David Ahern <dsahern@gmail.com> > >> > Cc: Martin KaFai Lau <martin.lau@linux.dev> > >> > Cc: Jakub Kicinski <kuba@kernel.org> > >> > Cc: Willem de Bruijn <willemb@google.com> > >> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > >> > Cc: Anatoly Burakov <anatoly.burakov@intel.com> > >> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com> > >> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com> > >> > Cc: Maryam Tahhan <mtahhan@redhat.com> > >> > Cc: xdp-hints@xdp-project.net > >> > Cc: netdev@vger.kernel.org > >> > Signed-off-by: Stanislav Fomichev <sdf@google.com> > >> > --- > >> > drivers/net/veth.c | 14 ++++++++++++++ > >> > 1 file changed, 14 insertions(+) > >> > > >> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > >> > index 2a4592780141..c626580a2294 100644 > >> > --- a/drivers/net/veth.c > >> > +++ b/drivers/net/veth.c > >> > @@ -25,6 +25,7 @@ > >> > #include <linux/filter.h> > >> > #include <linux/ptr_ring.h> > >> > #include <linux/bpf_trace.h> > >> > +#include <linux/bpf_patch.h> > >> > #include <linux/net_tstamp.h> > >> > > >> > #define DRV_NAME "veth" > >> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) > >> > } > >> > } > >> > > >> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > >> > + struct bpf_patch *patch) > >> > +{ > >> > + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > >> > + /* return true; */ > >> > + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > >> > + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > >> > + /* return ktime_get_mono_fast_ns(); */ > >> > + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > >> > + } > >> > +} > >> > >> So these look reasonable enough, but would be good to see some examples > >> of kfunc implementations that don't just BPF_CALL to a kernel function > >> (with those helper wrappers we were discussing before). > > > > Let's maybe add them if/when needed as we add more metadata support? > > xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > examples, so it shouldn't be a problem to resurrect them back at some > > point? > > Well, the reason I asked for them is that I think having to maintain the > BPF code generation in the drivers is probably the biggest drawback of > the kfunc approach, so it would be good to be relatively sure that we > can manage that complexity (via helpers) before we commit to this :) Right, and I've added a bunch of examples in v2 rfc so we can judge whether that complexity is manageable or not :-) Do you want me to add those wrappers you've back without any real users? Because I had to remove my veth tstamp accessors due to John/Jesper objections; I can maybe bring some of this back gated by some static_branch to avoid the fastpath cost?
Stanislav Fomichev wrote: > On Tue, Nov 15, 2022 at 2:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > > Stanislav Fomichev <sdf@google.com> writes: > > > > > On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > >> > > >> Stanislav Fomichev <sdf@google.com> writes: > > >> > > >> > The goal is to enable end-to-end testing of the metadata > > >> > for AF_XDP. Current rx_timestamp kfunc returns current > > >> > time which should be enough to exercise this new functionality. > > >> > > > >> > Cc: John Fastabend <john.fastabend@gmail.com> > > >> > Cc: David Ahern <dsahern@gmail.com> > > >> > Cc: Martin KaFai Lau <martin.lau@linux.dev> > > >> > Cc: Jakub Kicinski <kuba@kernel.org> > > >> > Cc: Willem de Bruijn <willemb@google.com> > > >> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > > >> > Cc: Anatoly Burakov <anatoly.burakov@intel.com> > > >> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com> > > >> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com> > > >> > Cc: Maryam Tahhan <mtahhan@redhat.com> > > >> > Cc: xdp-hints@xdp-project.net > > >> > Cc: netdev@vger.kernel.org > > >> > Signed-off-by: Stanislav Fomichev <sdf@google.com> > > >> > --- > > >> > drivers/net/veth.c | 14 ++++++++++++++ > > >> > 1 file changed, 14 insertions(+) > > >> > > > >> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > > >> > index 2a4592780141..c626580a2294 100644 > > >> > --- a/drivers/net/veth.c > > >> > +++ b/drivers/net/veth.c > > >> > @@ -25,6 +25,7 @@ > > >> > #include <linux/filter.h> > > >> > #include <linux/ptr_ring.h> > > >> > #include <linux/bpf_trace.h> > > >> > +#include <linux/bpf_patch.h> > > >> > #include <linux/net_tstamp.h> > > >> > > > >> > #define DRV_NAME "veth" > > >> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) > > >> > } > > >> > } > > >> > > > >> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > >> > + struct bpf_patch *patch) > > >> > +{ > > >> > + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > >> > + /* return true; */ > > >> > + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > >> > + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > >> > + /* return ktime_get_mono_fast_ns(); */ > > >> > + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > >> > + } > > >> > +} > > >> > > >> So these look reasonable enough, but would be good to see some examples > > >> of kfunc implementations that don't just BPF_CALL to a kernel function > > >> (with those helper wrappers we were discussing before). > > > > > > Let's maybe add them if/when needed as we add more metadata support? > > > xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > examples, so it shouldn't be a problem to resurrect them back at some > > > point? > > > > Well, the reason I asked for them is that I think having to maintain the > > BPF code generation in the drivers is probably the biggest drawback of > > the kfunc approach, so it would be good to be relatively sure that we > > can manage that complexity (via helpers) before we commit to this :) > > Right, and I've added a bunch of examples in v2 rfc so we can judge > whether that complexity is manageable or not :-) > Do you want me to add those wrappers you've back without any real users? > Because I had to remove my veth tstamp accessors due to John/Jesper > objections; I can maybe bring some of this back gated by some > static_branch to avoid the fastpath cost? I missed the context a bit what did you mean "would be good to see some examples of kfunc implementations that don't just BPF_CALL to a kernel function"? In this case do you mean BPF code directly without the call? Early on I thought we should just expose the rx_descriptor which would be roughly the same right? (difference being code embedded in driver vs a lib) Trouble I ran into is driver code using seqlock_t and mutexs which wasn't as straight forward as the simpler just read it from the descriptor. For example in mlx getting the ts would be easy from BPF with the mlx4_cqe struct exposed u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) { u64 hi, lo; struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; return hi | lo; } but converting that to nsec is a bit annoying, void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, struct skb_shared_hwtstamps *hwts, u64 timestamp) { unsigned int seq; u64 nsec; do { seq = read_seqbegin(&mdev->clock_lock); nsec = timecounter_cyc2time(&mdev->clock, timestamp); } while (read_seqretry(&mdev->clock_lock, seq)); memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); hwts->hwtstamp = ns_to_ktime(nsec); } I think the nsec is what you really want. With all the drivers doing slightly different ops we would have to create read_seqbegin, read_seqretry, mutex_lock, ... to get at least the mlx and ice drivers it looks like we would need some more BPF primitives/helpers. Looks like some more work is needed on ice driver though to get rx tstamps on all packets. Anyways this convinced me real devices will probably use BPF_CALL and not BPF insns directly.
On 11/15/22 10:38 PM, John Fastabend wrote: >>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, >>>>>> + struct bpf_patch *patch) >>>>>> +{ >>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { >>>>>> + /* return true; */ >>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); >>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { >>>>>> + /* return ktime_get_mono_fast_ns(); */ >>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); >>>>>> + } >>>>>> +} >>>>> >>>>> So these look reasonable enough, but would be good to see some examples >>>>> of kfunc implementations that don't just BPF_CALL to a kernel function >>>>> (with those helper wrappers we were discussing before). >>>> >>>> Let's maybe add them if/when needed as we add more metadata support? >>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more >>>> examples, so it shouldn't be a problem to resurrect them back at some >>>> point? >>> >>> Well, the reason I asked for them is that I think having to maintain the >>> BPF code generation in the drivers is probably the biggest drawback of >>> the kfunc approach, so it would be good to be relatively sure that we >>> can manage that complexity (via helpers) before we commit to this :) >> >> Right, and I've added a bunch of examples in v2 rfc so we can judge >> whether that complexity is manageable or not :-) >> Do you want me to add those wrappers you've back without any real users? >> Because I had to remove my veth tstamp accessors due to John/Jesper >> objections; I can maybe bring some of this back gated by some >> static_branch to avoid the fastpath cost? > > I missed the context a bit what did you mean "would be good to see some > examples of kfunc implementations that don't just BPF_CALL to a kernel > function"? In this case do you mean BPF code directly without the call? > > Early on I thought we should just expose the rx_descriptor which would > be roughly the same right? (difference being code embedded in driver vs > a lib) Trouble I ran into is driver code using seqlock_t and mutexs > which wasn't as straight forward as the simpler just read it from > the descriptor. For example in mlx getting the ts would be easy from > BPF with the mlx4_cqe struct exposed > > u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > { > u64 hi, lo; > struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > return hi | lo; > } > > but converting that to nsec is a bit annoying, > > void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > struct skb_shared_hwtstamps *hwts, > u64 timestamp) > { > unsigned int seq; > u64 nsec; > > do { > seq = read_seqbegin(&mdev->clock_lock); > nsec = timecounter_cyc2time(&mdev->clock, timestamp); > } while (read_seqretry(&mdev->clock_lock, seq)); > > memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > hwts->hwtstamp = ns_to_ktime(nsec); > } > > I think the nsec is what you really want. > > With all the drivers doing slightly different ops we would have > to create read_seqbegin, read_seqretry, mutex_lock, ... to get > at least the mlx and ice drivers it looks like we would need some > more BPF primitives/helpers. Looks like some more work is needed > on ice driver though to get rx tstamps on all packets. > > Anyways this convinced me real devices will probably use BPF_CALL > and not BPF insns directly. Some of the mlx5 path looks like this: #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, u64 timestamp) { u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); return ns_to_ktime(time); } If some hints are harder to get, then just doing a kfunc call is better. csum may have a better chance to inline? Regardless, BPF in-lining is a well solved problem and used in many bpf helpers already, so there are many examples in the kernel. I don't think it is necessary to block this series because of missing some helper wrappers for inlining. The driver can always start with the simpler kfunc call first and optimize later if some hints from the drivers allow it.
Martin KaFai Lau <martin.lau@linux.dev> writes: > On 11/15/22 10:38 PM, John Fastabend wrote: >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, >>>>>>> + struct bpf_patch *patch) >>>>>>> +{ >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { >>>>>>> + /* return true; */ >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { >>>>>>> + /* return ktime_get_mono_fast_ns(); */ >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); >>>>>>> + } >>>>>>> +} >>>>>> >>>>>> So these look reasonable enough, but would be good to see some examples >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function >>>>>> (with those helper wrappers we were discussing before). >>>>> >>>>> Let's maybe add them if/when needed as we add more metadata support? >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more >>>>> examples, so it shouldn't be a problem to resurrect them back at some >>>>> point? >>>> >>>> Well, the reason I asked for them is that I think having to maintain the >>>> BPF code generation in the drivers is probably the biggest drawback of >>>> the kfunc approach, so it would be good to be relatively sure that we >>>> can manage that complexity (via helpers) before we commit to this :) >>> >>> Right, and I've added a bunch of examples in v2 rfc so we can judge >>> whether that complexity is manageable or not :-) >>> Do you want me to add those wrappers you've back without any real users? >>> Because I had to remove my veth tstamp accessors due to John/Jesper >>> objections; I can maybe bring some of this back gated by some >>> static_branch to avoid the fastpath cost? >> >> I missed the context a bit what did you mean "would be good to see some >> examples of kfunc implementations that don't just BPF_CALL to a kernel >> function"? In this case do you mean BPF code directly without the call? >> >> Early on I thought we should just expose the rx_descriptor which would >> be roughly the same right? (difference being code embedded in driver vs >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs >> which wasn't as straight forward as the simpler just read it from >> the descriptor. For example in mlx getting the ts would be easy from >> BPF with the mlx4_cqe struct exposed >> >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) >> { >> u64 hi, lo; >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; >> >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; >> >> return hi | lo; >> } >> >> but converting that to nsec is a bit annoying, >> >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, >> struct skb_shared_hwtstamps *hwts, >> u64 timestamp) >> { >> unsigned int seq; >> u64 nsec; >> >> do { >> seq = read_seqbegin(&mdev->clock_lock); >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); >> } while (read_seqretry(&mdev->clock_lock, seq)); >> >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); >> hwts->hwtstamp = ns_to_ktime(nsec); >> } >> >> I think the nsec is what you really want. >> >> With all the drivers doing slightly different ops we would have >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get >> at least the mlx and ice drivers it looks like we would need some >> more BPF primitives/helpers. Looks like some more work is needed >> on ice driver though to get rx tstamps on all packets. >> >> Anyways this convinced me real devices will probably use BPF_CALL >> and not BPF insns directly. > > Some of the mlx5 path looks like this: > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > u64 timestamp) > { > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > return ns_to_ktime(time); > } > > If some hints are harder to get, then just doing a kfunc call is better. Sure, but if we end up having a full function call for every field in the metadata, that will end up having a significant performance impact on the XDP data path (thinking mostly about the skb metadata case here, which will collect several bits of metadata). > csum may have a better chance to inline? Yup, I agree. Including that also makes it possible to benchmark this series against Jesper's; which I think we should definitely be doing before merging this. > Regardless, BPF in-lining is a well solved problem and used in many > bpf helpers already, so there are many examples in the kernel. I don't > think it is necessary to block this series because of missing some > helper wrappers for inlining. The driver can always start with the > simpler kfunc call first and optimize later if some hints from the > drivers allow it. Well, "solved" in the sense of "there are a few handfuls of core BPF people who know how to do it". My concern is that we'll end up with either the BPF devs having to maintain all these bits of BPF byte code in all the drivers; or drivers just punting to regular function calls because the inlining is too complicated, with sub-par performance as per the above. I don't think we should just hand-wave this away as "solved", but rather treat this as an integral part of the initial series. -Toke
On 11/16/22 2:08 AM, Toke Høiland-Jørgensen wrote: > Martin KaFai Lau <martin.lau@linux.dev> writes: > >> On 11/15/22 10:38 PM, John Fastabend wrote: >>>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, >>>>>>>> + struct bpf_patch *patch) >>>>>>>> +{ >>>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { >>>>>>>> + /* return true; */ >>>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); >>>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { >>>>>>>> + /* return ktime_get_mono_fast_ns(); */ >>>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); >>>>>>>> + } >>>>>>>> +} >>>>>>> >>>>>>> So these look reasonable enough, but would be good to see some examples >>>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function >>>>>>> (with those helper wrappers we were discussing before). >>>>>> >>>>>> Let's maybe add them if/when needed as we add more metadata support? >>>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more >>>>>> examples, so it shouldn't be a problem to resurrect them back at some >>>>>> point? >>>>> >>>>> Well, the reason I asked for them is that I think having to maintain the >>>>> BPF code generation in the drivers is probably the biggest drawback of >>>>> the kfunc approach, so it would be good to be relatively sure that we >>>>> can manage that complexity (via helpers) before we commit to this :) >>>> >>>> Right, and I've added a bunch of examples in v2 rfc so we can judge >>>> whether that complexity is manageable or not :-) >>>> Do you want me to add those wrappers you've back without any real users? >>>> Because I had to remove my veth tstamp accessors due to John/Jesper >>>> objections; I can maybe bring some of this back gated by some >>>> static_branch to avoid the fastpath cost? >>> >>> I missed the context a bit what did you mean "would be good to see some >>> examples of kfunc implementations that don't just BPF_CALL to a kernel >>> function"? In this case do you mean BPF code directly without the call? >>> >>> Early on I thought we should just expose the rx_descriptor which would >>> be roughly the same right? (difference being code embedded in driver vs >>> a lib) Trouble I ran into is driver code using seqlock_t and mutexs >>> which wasn't as straight forward as the simpler just read it from >>> the descriptor. For example in mlx getting the ts would be easy from >>> BPF with the mlx4_cqe struct exposed >>> >>> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) >>> { >>> u64 hi, lo; >>> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; >>> >>> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); >>> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; >>> >>> return hi | lo; >>> } >>> >>> but converting that to nsec is a bit annoying, >>> >>> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, >>> struct skb_shared_hwtstamps *hwts, >>> u64 timestamp) >>> { >>> unsigned int seq; >>> u64 nsec; >>> >>> do { >>> seq = read_seqbegin(&mdev->clock_lock); >>> nsec = timecounter_cyc2time(&mdev->clock, timestamp); >>> } while (read_seqretry(&mdev->clock_lock, seq)); >>> >>> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); >>> hwts->hwtstamp = ns_to_ktime(nsec); >>> } >>> >>> I think the nsec is what you really want. >>> >>> With all the drivers doing slightly different ops we would have >>> to create read_seqbegin, read_seqretry, mutex_lock, ... to get >>> at least the mlx and ice drivers it looks like we would need some >>> more BPF primitives/helpers. Looks like some more work is needed >>> on ice driver though to get rx tstamps on all packets. >>> >>> Anyways this convinced me real devices will probably use BPF_CALL >>> and not BPF insns directly. >> >> Some of the mlx5 path looks like this: >> >> #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) >> >> static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, >> u64 timestamp) >> { >> u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); >> >> return ns_to_ktime(time); >> } >> >> If some hints are harder to get, then just doing a kfunc call is better. > > Sure, but if we end up having a full function call for every field in > the metadata, that will end up having a significant performance impact > on the XDP data path (thinking mostly about the skb metadata case here, > which will collect several bits of metadata). > >> csum may have a better chance to inline? It will be useful if the skb_metadata can get at least one more field like csum or vlan. > > Yup, I agree. Including that also makes it possible to benchmark this > series against Jesper's; which I think we should definitely be doing > before merging this. If the hint needs a lock to obtain it, it is not cheap to begin with regardless of kfunc or not. Also, there is bpf_xdp_metadata_rx_timestamp_supported() before doing the bpf_xdp_metadata_rx_timestamp(). This set gives the xdp prog a flexible way to avoid getting all hints (some of them require a lock) if all the xdp prog needs is a csum. imo, giving this flexibility to the xdp prog is the important thing for this set in terms of performance. > >> Regardless, BPF in-lining is a well solved problem and used in many >> bpf helpers already, so there are many examples in the kernel. I don't >> think it is necessary to block this series because of missing some >> helper wrappers for inlining. The driver can always start with the >> simpler kfunc call first and optimize later if some hints from the >> drivers allow it. > > Well, "solved" in the sense of "there are a few handfuls of core BPF > people who know how to do it". My concern is that we'll end up with > either the BPF devs having to maintain all these bits of BPF byte code > in all the drivers; or drivers just punting to regular function calls > because the inlining is too complicated, with sub-par performance as per > the above. I don't think we should just hand-wave this away as "solved", > but rather treat this as an integral part of the initial series. In terms of complexity/maintainability, I don't think the driver needs to inline like hundreds of bpf insn for a hint which is already a good signal that it should just call kfunc. For a simple hint, xdp_metadata_export_to_skb() is a good example to start with and I am not sure how much a helper wrapper can do to simplify things further since each driver inline code is going to be different. The community's review will definitely be useful for the first few drivers when the driver changes is posted to the list, and the latter drivers will have easier time to follow like the xdp was initially introduced to the few drivers first. When there are things to refactor after enough drivers supporting it, that will be a better time to revisit.
Toke Høiland-Jørgensen wrote: > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > On 11/15/22 10:38 PM, John Fastabend wrote: > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > >>>>>>> + struct bpf_patch *patch) > >>>>>>> +{ > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > >>>>>>> + /* return true; */ > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > >>>>>>> + } > >>>>>>> +} > >>>>>> > >>>>>> So these look reasonable enough, but would be good to see some examples > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > >>>>>> (with those helper wrappers we were discussing before). > >>>>> > >>>>> Let's maybe add them if/when needed as we add more metadata support? > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > >>>>> point? > >>>> > >>>> Well, the reason I asked for them is that I think having to maintain the > >>>> BPF code generation in the drivers is probably the biggest drawback of > >>>> the kfunc approach, so it would be good to be relatively sure that we > >>>> can manage that complexity (via helpers) before we commit to this :) > >>> > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > >>> whether that complexity is manageable or not :-) > >>> Do you want me to add those wrappers you've back without any real users? > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > >>> objections; I can maybe bring some of this back gated by some > >>> static_branch to avoid the fastpath cost? > >> > >> I missed the context a bit what did you mean "would be good to see some > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > >> function"? In this case do you mean BPF code directly without the call? > >> > >> Early on I thought we should just expose the rx_descriptor which would > >> be roughly the same right? (difference being code embedded in driver vs > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > >> which wasn't as straight forward as the simpler just read it from > >> the descriptor. For example in mlx getting the ts would be easy from > >> BPF with the mlx4_cqe struct exposed > >> > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > >> { > >> u64 hi, lo; > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > >> > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > >> > >> return hi | lo; > >> } > >> > >> but converting that to nsec is a bit annoying, > >> > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > >> struct skb_shared_hwtstamps *hwts, > >> u64 timestamp) > >> { > >> unsigned int seq; > >> u64 nsec; > >> > >> do { > >> seq = read_seqbegin(&mdev->clock_lock); > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > >> } while (read_seqretry(&mdev->clock_lock, seq)); > >> > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > >> hwts->hwtstamp = ns_to_ktime(nsec); > >> } > >> > >> I think the nsec is what you really want. > >> > >> With all the drivers doing slightly different ops we would have > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > >> at least the mlx and ice drivers it looks like we would need some > >> more BPF primitives/helpers. Looks like some more work is needed > >> on ice driver though to get rx tstamps on all packets. > >> > >> Anyways this convinced me real devices will probably use BPF_CALL > >> and not BPF insns directly. > > > > Some of the mlx5 path looks like this: > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > u64 timestamp) > > { > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > return ns_to_ktime(time); > > } > > > > If some hints are harder to get, then just doing a kfunc call is better. > > Sure, but if we end up having a full function call for every field in > the metadata, that will end up having a significant performance impact > on the XDP data path (thinking mostly about the skb metadata case here, > which will collect several bits of metadata). > > > csum may have a better chance to inline? > > Yup, I agree. Including that also makes it possible to benchmark this > series against Jesper's; which I think we should definitely be doing > before merging this. Good point I got sort of singularly focused on timestamp because I have a use case for it now. Also hash is often sitting in the rx descriptor. > > > Regardless, BPF in-lining is a well solved problem and used in many > > bpf helpers already, so there are many examples in the kernel. I don't > > think it is necessary to block this series because of missing some > > helper wrappers for inlining. The driver can always start with the > > simpler kfunc call first and optimize later if some hints from the > > drivers allow it. > > Well, "solved" in the sense of "there are a few handfuls of core BPF > people who know how to do it". My concern is that we'll end up with > either the BPF devs having to maintain all these bits of BPF byte code > in all the drivers; or drivers just punting to regular function calls > because the inlining is too complicated, with sub-par performance as per > the above. I don't think we should just hand-wave this away as "solved", > but rather treat this as an integral part of the initial series. This was my motivation for pushing the rx_descriptor into the xdp_buff. At this point if I'm going to have a kfunc call into the driver and have the driver rewrite the code into some BPF instructions I would just assume maintain this as a library code where I can hook it into my BPF program directly from user space. Maybe a few drivers will support all the things I want to read, but we run on lots of hardware (mlx, intel, eks, azure, gke, etc) and its been a lot of work to just get the basic feature parity. I also don't want to run around writing driver code for each vendor if I can avoid it. Having raw access to the rx descriptor gets me the escape hatch where I can just do it myself. And the last piece of info from my point of view (Tetragon, Cilium) I can run whatever libs I want and freely upgrade libbpf and cilium/ebpf but have a lot less ability to get users to upgrade kernels outside the LTS they picked. Meaning I can add new things much easier if its lifted into BPF code placed by user space. I appreciate that it means I import the problem of hardware detection and BTF CO-RE on networking codes, but we've already solved these problems for other reasons. For example just configuring the timestamp is a bit of an exercise in does my hardware support timestamping and does it support timestamping the packets I care about, e.g. all pkts, just ptp pkts, etc. I don't think they are mutual exclusive with this series though because I can't see how to write these timestamping logic directly in BPF. But for rxhash and csum it seems doable. My preference is to have both the kfuncs and expose the descriptor directly. .John
On Wed, Nov 16, 2022 at 11:03 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Toke Høiland-Jørgensen wrote: > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > >>>>>>> + struct bpf_patch *patch) > > >>>>>>> +{ > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > >>>>>>> + /* return true; */ > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > >>>>>>> + } > > >>>>>>> +} > > >>>>>> > > >>>>>> So these look reasonable enough, but would be good to see some examples > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > >>>>>> (with those helper wrappers we were discussing before). > > >>>>> > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > >>>>> point? > > >>>> > > >>>> Well, the reason I asked for them is that I think having to maintain the > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > >>>> can manage that complexity (via helpers) before we commit to this :) > > >>> > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > >>> whether that complexity is manageable or not :-) > > >>> Do you want me to add those wrappers you've back without any real users? > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > >>> objections; I can maybe bring some of this back gated by some > > >>> static_branch to avoid the fastpath cost? > > >> > > >> I missed the context a bit what did you mean "would be good to see some > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > >> function"? In this case do you mean BPF code directly without the call? > > >> > > >> Early on I thought we should just expose the rx_descriptor which would > > >> be roughly the same right? (difference being code embedded in driver vs > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > >> which wasn't as straight forward as the simpler just read it from > > >> the descriptor. For example in mlx getting the ts would be easy from > > >> BPF with the mlx4_cqe struct exposed > > >> > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > >> { > > >> u64 hi, lo; > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > >> > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > >> > > >> return hi | lo; > > >> } > > >> > > >> but converting that to nsec is a bit annoying, > > >> > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > >> struct skb_shared_hwtstamps *hwts, > > >> u64 timestamp) > > >> { > > >> unsigned int seq; > > >> u64 nsec; > > >> > > >> do { > > >> seq = read_seqbegin(&mdev->clock_lock); > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > >> > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > >> } > > >> > > >> I think the nsec is what you really want. > > >> > > >> With all the drivers doing slightly different ops we would have > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > >> at least the mlx and ice drivers it looks like we would need some > > >> more BPF primitives/helpers. Looks like some more work is needed > > >> on ice driver though to get rx tstamps on all packets. > > >> > > >> Anyways this convinced me real devices will probably use BPF_CALL > > >> and not BPF insns directly. > > > > > > Some of the mlx5 path looks like this: > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > u64 timestamp) > > > { > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > return ns_to_ktime(time); > > > } > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > Sure, but if we end up having a full function call for every field in > > the metadata, that will end up having a significant performance impact > > on the XDP data path (thinking mostly about the skb metadata case here, > > which will collect several bits of metadata). > > > > > csum may have a better chance to inline? > > > > Yup, I agree. Including that also makes it possible to benchmark this > > series against Jesper's; which I think we should definitely be doing > > before merging this. > > Good point I got sort of singularly focused on timestamp because I have > a use case for it now. > > Also hash is often sitting in the rx descriptor. Ack, let me try to add something else (that's more inline-able) on the rx side for a v2. > > > > > Regardless, BPF in-lining is a well solved problem and used in many > > > bpf helpers already, so there are many examples in the kernel. I don't > > > think it is necessary to block this series because of missing some > > > helper wrappers for inlining. The driver can always start with the > > > simpler kfunc call first and optimize later if some hints from the > > > drivers allow it. > > > > Well, "solved" in the sense of "there are a few handfuls of core BPF > > people who know how to do it". My concern is that we'll end up with > > either the BPF devs having to maintain all these bits of BPF byte code > > in all the drivers; or drivers just punting to regular function calls > > because the inlining is too complicated, with sub-par performance as per > > the above. I don't think we should just hand-wave this away as "solved", > > but rather treat this as an integral part of the initial series. > > This was my motivation for pushing the rx_descriptor into the xdp_buff. > At this point if I'm going to have a kfunc call into the driver and > have the driver rewrite the code into some BPF instructions I would > just assume maintain this as a library code where I can hook it > into my BPF program directly from user space. Maybe a few drivers > will support all the things I want to read, but we run on lots of > hardware (mlx, intel, eks, azure, gke, etc) and its been a lot of work > to just get the basic feature parity. I also don't want to run around > writing driver code for each vendor if I can avoid it. Having raw > access to the rx descriptor gets me the escape hatch where I can > just do it myself. And the last piece of info from my point of view > (Tetragon, Cilium) I can run whatever libs I want and freely upgrade > libbpf and cilium/ebpf but have a lot less ability to get users > to upgrade kernels outside the LTS they picked. Meaning I can > add new things much easier if its lifted into BPF code placed > by user space. > > I appreciate that it means I import the problem of hardware detection > and BTF CO-RE on networking codes, but we've already solved these > problems for other reasons. For example just configuring the timestamp > is a bit of an exercise in does my hardware support timestamping > and does it support timestamping the packets I care about, e.g. > all pkts, just ptp pkts, etc. > > I don't think they are mutual exclusive with this series though > because I can't see how to write these timestamping logic directly > in BPF. But for rxhash and csum it seems doable. My preference > is to have both the kfuncs and expose the descriptor directly. > > .John
Stanislav Fomichev wrote: > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Toke Høiland-Jørgensen wrote: > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > >>>>>>> + struct bpf_patch *patch) > > > >>>>>>> +{ > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > >>>>>>> + /* return true; */ > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > >>>>>>> + } > > > >>>>>>> +} > > > >>>>>> > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > >>>>>> (with those helper wrappers we were discussing before). > > > >>>>> > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > >>>>> point? > > > >>>> > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > >>> > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > >>> whether that complexity is manageable or not :-) > > > >>> Do you want me to add those wrappers you've back without any real users? > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > >>> objections; I can maybe bring some of this back gated by some > > > >>> static_branch to avoid the fastpath cost? > > > >> > > > >> I missed the context a bit what did you mean "would be good to see some > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > >> function"? In this case do you mean BPF code directly without the call? > > > >> > > > >> Early on I thought we should just expose the rx_descriptor which would > > > >> be roughly the same right? (difference being code embedded in driver vs > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > >> which wasn't as straight forward as the simpler just read it from > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > >> BPF with the mlx4_cqe struct exposed > > > >> > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > >> { > > > >> u64 hi, lo; > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > >> > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > >> > > > >> return hi | lo; > > > >> } > > > >> > > > >> but converting that to nsec is a bit annoying, > > > >> > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > >> struct skb_shared_hwtstamps *hwts, > > > >> u64 timestamp) > > > >> { > > > >> unsigned int seq; > > > >> u64 nsec; > > > >> > > > >> do { > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > >> > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > >> } > > > >> > > > >> I think the nsec is what you really want. > > > >> > > > >> With all the drivers doing slightly different ops we would have > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > >> at least the mlx and ice drivers it looks like we would need some > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > >> on ice driver though to get rx tstamps on all packets. > > > >> > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > >> and not BPF insns directly. > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > u64 timestamp) > > > > { > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > return ns_to_ktime(time); > > > > } > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > Sure, but if we end up having a full function call for every field in > > > the metadata, that will end up having a significant performance impact > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > which will collect several bits of metadata). > > > > > > > csum may have a better chance to inline? > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > series against Jesper's; which I think we should definitely be doing > > > before merging this. > > > > Good point I got sort of singularly focused on timestamp because I have > > a use case for it now. > > > > Also hash is often sitting in the rx descriptor. > > Ack, let me try to add something else (that's more inline-able) on the > rx side for a v2. If you go with in-kernel BPF kfunc approach (vs user space side) I think you also need to add CO-RE to be friendly for driver developers? Otherwise they have to keep that read in sync with the descriptors? Also need to handle versioning of descriptors where depending on specific options and firmware and chip being enabled the descriptor might be moving around. Of course can push this all to developer, but seems not so nice when we have the machinery to do this and we handle it for all other structures. With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and expect CO-RE sorts it out based on currently running btf_id of the descriptor. If you go through normal path you get this for free of course. .John
On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Stanislav Fomichev wrote: > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > <john.fastabend@gmail.com> wrote: > > > > > > Toke Høiland-Jørgensen wrote: > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > >>>>>>> + struct bpf_patch *patch) > > > > >>>>>>> +{ > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > >>>>>>> + /* return true; */ > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > >>>>>>> + } > > > > >>>>>>> +} > > > > >>>>>> > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > >>>>> > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > >>>>> point? > > > > >>>> > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > >>> > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > >>> whether that complexity is manageable or not :-) > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > >>> objections; I can maybe bring some of this back gated by some > > > > >>> static_branch to avoid the fastpath cost? > > > > >> > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > >> > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > >> which wasn't as straight forward as the simpler just read it from > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > >> BPF with the mlx4_cqe struct exposed > > > > >> > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > >> { > > > > >> u64 hi, lo; > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > >> > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > >> > > > > >> return hi | lo; > > > > >> } > > > > >> > > > > >> but converting that to nsec is a bit annoying, > > > > >> > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > >> struct skb_shared_hwtstamps *hwts, > > > > >> u64 timestamp) > > > > >> { > > > > >> unsigned int seq; > > > > >> u64 nsec; > > > > >> > > > > >> do { > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > >> > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > >> } > > > > >> > > > > >> I think the nsec is what you really want. > > > > >> > > > > >> With all the drivers doing slightly different ops we would have > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > >> on ice driver though to get rx tstamps on all packets. > > > > >> > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > >> and not BPF insns directly. > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > u64 timestamp) > > > > > { > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > return ns_to_ktime(time); > > > > > } > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > the metadata, that will end up having a significant performance impact > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > which will collect several bits of metadata). > > > > > > > > > csum may have a better chance to inline? > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > series against Jesper's; which I think we should definitely be doing > > > > before merging this. > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > a use case for it now. > > > > > > Also hash is often sitting in the rx descriptor. > > > > Ack, let me try to add something else (that's more inline-able) on the > > rx side for a v2. > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > you also need to add CO-RE to be friendly for driver developers? Otherwise > they have to keep that read in sync with the descriptors? Also need to > handle versioning of descriptors where depending on specific options > and firmware and chip being enabled the descriptor might be moving > around. Of course can push this all to developer, but seems not so > nice when we have the machinery to do this and we handle it for all > other structures. > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > expect CO-RE sorts it out based on currently running btf_id of the > descriptor. If you go through normal path you get this for free of > course. Doesn't look like the descriptors are as nice as you're trying to paint them (with clear hash/csum fields) :-) So not sure how much CO-RE would help. At least looking at mlx4 rx_csum, the driver consults three different sets of flags to figure out the hash_type. Or am I just unlucky with mlx4?
On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Stanislav Fomichev wrote: > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > <john.fastabend@gmail.com> wrote: > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > >>>>>>> +{ > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > >>>>>>> + /* return true; */ > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > >>>>>>> + } > > > > > >>>>>>> +} > > > > > >>>>>> > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > >>>>> > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > >>>>> point? > > > > > >>>> > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > >>> > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > >>> whether that complexity is manageable or not :-) > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > >>> static_branch to avoid the fastpath cost? > > > > > >> > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > >> > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > >> > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > >> { > > > > > >> u64 hi, lo; > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > >> > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > >> > > > > > >> return hi | lo; > > > > > >> } > > > > > >> > > > > > >> but converting that to nsec is a bit annoying, > > > > > >> > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > >> u64 timestamp) > > > > > >> { > > > > > >> unsigned int seq; > > > > > >> u64 nsec; > > > > > >> > > > > > >> do { > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > >> > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > >> } > > > > > >> > > > > > >> I think the nsec is what you really want. > > > > > >> > > > > > >> With all the drivers doing slightly different ops we would have > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > >> > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > >> and not BPF insns directly. > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > u64 timestamp) > > > > > > { > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > } > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > the metadata, that will end up having a significant performance impact > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > which will collect several bits of metadata). > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > series against Jesper's; which I think we should definitely be doing > > > > > before merging this. > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > a use case for it now. > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > rx side for a v2. > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > they have to keep that read in sync with the descriptors? Also need to > > handle versioning of descriptors where depending on specific options > > and firmware and chip being enabled the descriptor might be moving > > around. Of course can push this all to developer, but seems not so > > nice when we have the machinery to do this and we handle it for all > > other structures. > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > expect CO-RE sorts it out based on currently running btf_id of the > > descriptor. If you go through normal path you get this for free of > > course. > > Doesn't look like the descriptors are as nice as you're trying to > paint them (with clear hash/csum fields) :-) So not sure how much > CO-RE would help. > At least looking at mlx4 rx_csum, the driver consults three different > sets of flags to figure out the hash_type. Or am I just unlucky with > mlx4? Which part are you talking about ? hw_checksum = csum_unfold((__force __sum16)cqe->checksum); is trivial enough for bpf prog to do if it has access to 'cqe' pointer which is what John is proposing (I think).
On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > Stanislav Fomichev wrote: > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > > >>>>>>> +{ > > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > > >>>>>>> + /* return true; */ > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > > >>>>>>> + } > > > > > > >>>>>>> +} > > > > > > >>>>>> > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > > >>>>> > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > > >>>>> point? > > > > > > >>>> > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > > >>> > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > > >>> whether that complexity is manageable or not :-) > > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > > >>> static_branch to avoid the fastpath cost? > > > > > > >> > > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > > >> > > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > > >> > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > > >> { > > > > > > >> u64 hi, lo; > > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > > >> > > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > > >> > > > > > > >> return hi | lo; > > > > > > >> } > > > > > > >> > > > > > > >> but converting that to nsec is a bit annoying, > > > > > > >> > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > > >> u64 timestamp) > > > > > > >> { > > > > > > >> unsigned int seq; > > > > > > >> u64 nsec; > > > > > > >> > > > > > > >> do { > > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > > >> > > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > > >> } > > > > > > >> > > > > > > >> I think the nsec is what you really want. > > > > > > >> > > > > > > >> With all the drivers doing slightly different ops we would have > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > > >> > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > > >> and not BPF insns directly. > > > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > > u64 timestamp) > > > > > > > { > > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > > } > > > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > > the metadata, that will end up having a significant performance impact > > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > > which will collect several bits of metadata). > > > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > > series against Jesper's; which I think we should definitely be doing > > > > > > before merging this. > > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > > a use case for it now. > > > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > > rx side for a v2. > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > > they have to keep that read in sync with the descriptors? Also need to > > > handle versioning of descriptors where depending on specific options > > > and firmware and chip being enabled the descriptor might be moving > > > around. Of course can push this all to developer, but seems not so > > > nice when we have the machinery to do this and we handle it for all > > > other structures. > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > > expect CO-RE sorts it out based on currently running btf_id of the > > > descriptor. If you go through normal path you get this for free of > > > course. > > > > Doesn't look like the descriptors are as nice as you're trying to > > paint them (with clear hash/csum fields) :-) So not sure how much > > CO-RE would help. > > At least looking at mlx4 rx_csum, the driver consults three different > > sets of flags to figure out the hash_type. Or am I just unlucky with > > mlx4? > > Which part are you talking about ? > hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > is trivial enough for bpf prog to do if it has access to 'cqe' pointer > which is what John is proposing (I think). I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch I'm assuming we want to have hash_type available to the progs? But also, check_csum handles other corner cases: - short_frame: we simply force all those small frames to skip checksum complete - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from IPv6 header - get_fixed_ipv4_csum: Although the stack expects checksum which doesn't include the pseudo header, the HW adds it So it doesn't look like we can just unconditionally use cqe->checksum? The driver does a lot of massaging around that field to make it palatable.
On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > > > Stanislav Fomichev wrote: > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > > > >>>>>>> +{ > > > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > > > >>>>>>> + /* return true; */ > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > > > >>>>>>> + } > > > > > > > >>>>>>> +} > > > > > > > >>>>>> > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > > > >>>>> > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > > > >>>>> point? > > > > > > > >>>> > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > > > >>> > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > > > >>> whether that complexity is manageable or not :-) > > > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > > > >>> static_branch to avoid the fastpath cost? > > > > > > > >> > > > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > > > >> > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > > > >> > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > > > >> { > > > > > > > >> u64 hi, lo; > > > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > > > >> > > > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > > > >> > > > > > > > >> return hi | lo; > > > > > > > >> } > > > > > > > >> > > > > > > > >> but converting that to nsec is a bit annoying, > > > > > > > >> > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > > > >> u64 timestamp) > > > > > > > >> { > > > > > > > >> unsigned int seq; > > > > > > > >> u64 nsec; > > > > > > > >> > > > > > > > >> do { > > > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > > > >> > > > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > > > >> } > > > > > > > >> > > > > > > > >> I think the nsec is what you really want. > > > > > > > >> > > > > > > > >> With all the drivers doing slightly different ops we would have > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > > > >> > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > > > >> and not BPF insns directly. > > > > > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > > > u64 timestamp) > > > > > > > > { > > > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > > > } > > > > > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > > > the metadata, that will end up having a significant performance impact > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > > > which will collect several bits of metadata). > > > > > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > > > series against Jesper's; which I think we should definitely be doing > > > > > > > before merging this. > > > > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > > > a use case for it now. > > > > > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > > > rx side for a v2. > > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > > > they have to keep that read in sync with the descriptors? Also need to > > > > handle versioning of descriptors where depending on specific options > > > > and firmware and chip being enabled the descriptor might be moving > > > > around. Of course can push this all to developer, but seems not so > > > > nice when we have the machinery to do this and we handle it for all > > > > other structures. > > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > > > expect CO-RE sorts it out based on currently running btf_id of the > > > > descriptor. If you go through normal path you get this for free of > > > > course. > > > > > > Doesn't look like the descriptors are as nice as you're trying to > > > paint them (with clear hash/csum fields) :-) So not sure how much > > > CO-RE would help. > > > At least looking at mlx4 rx_csum, the driver consults three different > > > sets of flags to figure out the hash_type. Or am I just unlucky with > > > mlx4? > > > > Which part are you talking about ? > > hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer > > which is what John is proposing (I think). > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch > I'm assuming we want to have hash_type available to the progs? > > But also, check_csum handles other corner cases: > - short_frame: we simply force all those small frames to skip checksum complete > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from > IPv6 header > - get_fixed_ipv4_csum: Although the stack expects checksum which > doesn't include the pseudo header, the HW adds it Of course, but kfunc won't be doing them either. We're talking XDP fast path. The mlx4 hw is old and incapable. No amount of sw can help. > So it doesn't look like we can just unconditionally use cqe->checksum? > The driver does a lot of massaging around that field to make it > palatable. Of course we can. cqe->checksum is still usable. the bpf prog would need to know what it's reading. There should be no attempt to present a unified state of hw bits. That's what skb is for. XDP layer should not hide such hw details. Otherwise it will become a mini skb layer with all that overhead.
On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > > > > > Stanislav Fomichev wrote: > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > > > > >>>>>>> +{ > > > > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > > > > >>>>>>> + /* return true; */ > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > > > > >>>>>>> + } > > > > > > > > >>>>>>> +} > > > > > > > > >>>>>> > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > > > > >>>>> > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > > > > >>>>> point? > > > > > > > > >>>> > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > > > > >>> > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > > > > >>> whether that complexity is manageable or not :-) > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > > > > >>> static_branch to avoid the fastpath cost? > > > > > > > > >> > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > > > > >> > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > > > > >> > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > > > > >> { > > > > > > > > >> u64 hi, lo; > > > > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > > > > >> > > > > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > > > > >> > > > > > > > > >> return hi | lo; > > > > > > > > >> } > > > > > > > > >> > > > > > > > > >> but converting that to nsec is a bit annoying, > > > > > > > > >> > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > > > > >> u64 timestamp) > > > > > > > > >> { > > > > > > > > >> unsigned int seq; > > > > > > > > >> u64 nsec; > > > > > > > > >> > > > > > > > > >> do { > > > > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > > > > >> > > > > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > > > > >> } > > > > > > > > >> > > > > > > > > >> I think the nsec is what you really want. > > > > > > > > >> > > > > > > > > >> With all the drivers doing slightly different ops we would have > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > > > > >> > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > > > > >> and not BPF insns directly. > > > > > > > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > > > > u64 timestamp) > > > > > > > > > { > > > > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > > > > } > > > > > > > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > > > > the metadata, that will end up having a significant performance impact > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > > > > which will collect several bits of metadata). > > > > > > > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > > > > series against Jesper's; which I think we should definitely be doing > > > > > > > > before merging this. > > > > > > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > > > > a use case for it now. > > > > > > > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > > > > rx side for a v2. > > > > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > > > > they have to keep that read in sync with the descriptors? Also need to > > > > > handle versioning of descriptors where depending on specific options > > > > > and firmware and chip being enabled the descriptor might be moving > > > > > around. Of course can push this all to developer, but seems not so > > > > > nice when we have the machinery to do this and we handle it for all > > > > > other structures. > > > > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > > > > expect CO-RE sorts it out based on currently running btf_id of the > > > > > descriptor. If you go through normal path you get this for free of > > > > > course. > > > > > > > > Doesn't look like the descriptors are as nice as you're trying to > > > > paint them (with clear hash/csum fields) :-) So not sure how much > > > > CO-RE would help. > > > > At least looking at mlx4 rx_csum, the driver consults three different > > > > sets of flags to figure out the hash_type. Or am I just unlucky with > > > > mlx4? > > > > > > Which part are you talking about ? > > > hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer > > > which is what John is proposing (I think). > > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch > > I'm assuming we want to have hash_type available to the progs? > > > > But also, check_csum handles other corner cases: > > - short_frame: we simply force all those small frames to skip checksum complete > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from > > IPv6 header > > - get_fixed_ipv4_csum: Although the stack expects checksum which > > doesn't include the pseudo header, the HW adds it > > Of course, but kfunc won't be doing them either. > We're talking XDP fast path. > The mlx4 hw is old and incapable. > No amount of sw can help. > > > So it doesn't look like we can just unconditionally use cqe->checksum? > > The driver does a lot of massaging around that field to make it > > palatable. > > Of course we can. cqe->checksum is still usable. the bpf prog > would need to know what it's reading. > There should be no attempt to present a unified state of hw bits. > That's what skb is for. XDP layer should not hide such hw details. > Otherwise it will become a mini skb layer with all that overhead. I was hoping the kfunc could at least parse the flags and return some pkt_hash_types-like enum to indicate what this csum covers. So the users won't have to find the hardware manuals (not sure they are even available?) to decipher what numbers they've got. Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look that much different :-( But going back a bit: I'm probably missing what John has been suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing a CO-RE-like functionality by rewriting some "public api" (kfunc) into the bytecode to access the relevant data.
Stanislav Fomichev wrote: > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > Stanislav Fomichev wrote: > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > > > > > >>>>>>> +{ > > > > > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > > > > > >>>>>>> + /* return true; */ > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > > > > > >>>>>>> + } > > > > > > > > > >>>>>>> +} > > > > > > > > > >>>>>> > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > > > > > >>>>> > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > > > > > >>>>> point? > > > > > > > > > >>>> > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > > > > > >>> > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > > > > > >>> whether that complexity is manageable or not :-) > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > > > > > >>> static_branch to avoid the fastpath cost? > > > > > > > > > >> > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > > > > > >> > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > > > > > >> > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > > > > > >> { > > > > > > > > > >> u64 hi, lo; > > > > > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > > > > > >> > > > > > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > > > > > >> > > > > > > > > > >> return hi | lo; > > > > > > > > > >> } > > > > > > > > > >> > > > > > > > > > >> but converting that to nsec is a bit annoying, > > > > > > > > > >> > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > > > > > >> u64 timestamp) > > > > > > > > > >> { > > > > > > > > > >> unsigned int seq; > > > > > > > > > >> u64 nsec; > > > > > > > > > >> > > > > > > > > > >> do { > > > > > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > > > > > >> > > > > > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > > > > > >> } > > > > > > > > > >> > > > > > > > > > >> I think the nsec is what you really want. > > > > > > > > > >> > > > > > > > > > >> With all the drivers doing slightly different ops we would have > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > > > > > >> > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > > > > > >> and not BPF insns directly. > > > > > > > > > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > > > > > u64 timestamp) > > > > > > > > > > { > > > > > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > > > > > the metadata, that will end up having a significant performance impact > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > > > > > which will collect several bits of metadata). > > > > > > > > > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > > > > > series against Jesper's; which I think we should definitely be doing > > > > > > > > > before merging this. > > > > > > > > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > > > > > a use case for it now. > > > > > > > > > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > > > > > rx side for a v2. > > > > > > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > > > > > they have to keep that read in sync with the descriptors? Also need to > > > > > > handle versioning of descriptors where depending on specific options > > > > > > and firmware and chip being enabled the descriptor might be moving > > > > > > around. Of course can push this all to developer, but seems not so > > > > > > nice when we have the machinery to do this and we handle it for all > > > > > > other structures. > > > > > > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > > > > > expect CO-RE sorts it out based on currently running btf_id of the > > > > > > descriptor. If you go through normal path you get this for free of > > > > > > course. > > > > > > > > > > Doesn't look like the descriptors are as nice as you're trying to > > > > > paint them (with clear hash/csum fields) :-) So not sure how much > > > > > CO-RE would help. > > > > > At least looking at mlx4 rx_csum, the driver consults three different > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with > > > > > mlx4? > > > > > > > > Which part are you talking about ? > > > > hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer > > > > which is what John is proposing (I think). Yeah this is what I've been considering. If you just get the 'cqe' pointer walking the check_sum and rxhash should be straightforward. There seems to be a real difference between timestamps and most other fields IMO. Timestamps require some extra logic to turn into ns when using the NIC hw clock. And the hw clock requires some coordination to keep in sync and stop from overflowing and may be done through other protocols like PTP in my use case. In some cases I think the clock is also shared amongst multiple phys. Seems mlx has a seqlock_t to protect it and I'm less sure about this but seems intel nic maybe has a sideband control channel. Then there is everything else that I can think up that is per packet data and requires no coordination with the driver. Its just reading fields in the completion queue. This would be the csum, check_sum, vlan_header and so on. Sure we could kfunc each one of those things, but could also just write that directly in BPF and remove some abstractions and kernel dependency by doing it directly in the BPF program. If you like that abstraction seems to be the point of contention my opinion is the cost of kernel depency is high and I can abstract it with a user library anyways so burying it in the kernel only causes me support issues and backwards compat problems. Hopefully, my position is more clear. > > > > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. > > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch > > > I'm assuming we want to have hash_type available to the progs? > > > > > > But also, check_csum handles other corner cases: > > > - short_frame: we simply force all those small frames to skip checksum complete > > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from > > > IPv6 header > > > - get_fixed_ipv4_csum: Although the stack expects checksum which > > > doesn't include the pseudo header, the HW adds it > > > > Of course, but kfunc won't be doing them either. > > We're talking XDP fast path. > > The mlx4 hw is old and incapable. > > No amount of sw can help. Doesn't this lend itself to letting the XDP BPF program write the BPF code to read it out. Maybe someone cares about these details for some cpumap thing, but the rest of us wont care we might just want to read check_csum. Maybe we have an IPv6 only network or IPv4 network so can make further shortcuts. If a driver dev does this they will be forced to do the cactch all version because they have no way to know such details. > > > > > So it doesn't look like we can just unconditionally use cqe->checksum? > > > The driver does a lot of massaging around that field to make it > > > palatable. > > > > Of course we can. cqe->checksum is still usable. the bpf prog > > would need to know what it's reading. > > There should be no attempt to present a unified state of hw bits. > > That's what skb is for. XDP layer should not hide such hw details. > > Otherwise it will become a mini skb layer with all that overhead. > > I was hoping the kfunc could at least parse the flags and return some > pkt_hash_types-like enum to indicate what this csum covers. > So the users won't have to find the hardware manuals (not sure they > are even available?) to decipher what numbers they've got. > Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look > that much different :-( The driver developers could still provide and ship the BPF libs with their drivers. I think if someone is going to use their NIC and lots of them and requires XDP it will get done. We could put them by the driver code mlx4.bpf or something. > > But going back a bit: I'm probably missing what John has been > suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing > a CO-RE-like functionality by rewriting some "public api" (kfunc) into > the bytecode to access the relevant data. This was maybe a bit of an aside. What I was pondering a bit out loud perhaps is my recollection is there may be a few different descriptor layouts depending features enabled, exact device loaded and such. So in this case if I was a driver writer I might not want to hardcode the offset of the check_sum field. If I could use CO-RE then I don't have to care if in one version is the Nth field and later on someone makes it the Mth field just like any normal kernel struct. But through the kfunc interface I couldn't see how to get that. So instead of having a bunch of kfunc implementations you could just have one for all your device classes because you always name the field the same thing.
Just a separate comment on this bit: John Fastabend <john.fastabend@gmail.com> writes: > If you go with in-kernel BPF kfunc approach (vs user space side) I think > you also need to add CO-RE to be friendly for driver developers? Otherwise > they have to keep that read in sync with the descriptors? CO-RE is for doing relocations of field offsets without having to recompile. That's not really relevant for the kernel, that gets recompiled whenever the layout changes. So the field offsets are just kept in sync with offsetof(), like in Stanislav's RFCv2 where he had this snippet: + /* return ((struct sk_buff *)r5)->tstamp; */ + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_5, + offsetof(struct sk_buff, tstamp)), So I definitely don't think this is an argument against the kfunc approach? > Also need to handle versioning of descriptors where depending on > specific options and firmware and chip being enabled the descriptor > might be moving around. This is exactly the kind of thing the driver is supposed to take care of; it knows the hardware configuration and can pick the right descriptor format. Either by just picking an entirely different kfunc unroll depending on the config (if it's static), or by adding the right checks to the unroll. You'd have to replicate all this logic in BPF anyway, and while I'm not doubting *you* are capable of doing this, I don't think we should be forcing every XDP developer to deal with all this. Or to put it another way, a proper hardware abstraction and high-quality drivers is one of the main selling points of XDP over DPDK and other kernel bypass solutions; we should not jettison this when enabling metadata support! -Toke
Stanislav Fomichev <sdf@google.com> writes: >> > Doesn't look like the descriptors are as nice as you're trying to >> > paint them (with clear hash/csum fields) :-) So not sure how much >> > CO-RE would help. >> > At least looking at mlx4 rx_csum, the driver consults three different >> > sets of flags to figure out the hash_type. Or am I just unlucky with >> > mlx4? >> >> Which part are you talking about ? >> hw_checksum = csum_unfold((__force __sum16)cqe->checksum); >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer >> which is what John is proposing (I think). > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch > I'm assuming we want to have hash_type available to the progs? I agree we should expose the hash_type, but that doesn't actually look to be that complicated, see below. > But also, check_csum handles other corner cases: > - short_frame: we simply force all those small frames to skip checksum complete > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from > IPv6 header > - get_fixed_ipv4_csum: Although the stack expects checksum which > doesn't include the pseudo header, the HW adds it > > So it doesn't look like we can just unconditionally use cqe->checksum? > The driver does a lot of massaging around that field to make it > palatable. Poking around a bit in the other drivers, AFAICT it's only a subset of drivers that support CSUM_COMPLETE at all; for instance, the Intel drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the CHECKSUM_UNNECESSARY is actually the most important bit we'd want to propagate? AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access to other data structures than the rx descriptor to determine the status of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so just exposing the rx descriptor to BPF as John is suggesting does not actually give the XDP program enough information to act on the checksum field on its own. We could still have a separate kfunc to just expose the hw checksum value (see below), but I think it probably needs to be paired with other kfuncs to be useful. Looking at the mlx4 code, I think the following mapping to kfuncs (in pseudo-C) would give the flexibility for XDP to access all the bits it needs, while inlining everything except getting the full checksum for non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other drivers (mlx5, ice, i40e) indicates that this would work for those drivers as well. bpf_xdp_metadata_rx_hash_supported() { return dev->features & NETIF_F_RXHASH; } bpf_xdp_metadata_rx_hash() { return be32_to_cpu(cqe->immed_rss_invalid); } bpf_xdp_metdata_rx_hash_type() { if (likely(dev->features & NETIF_F_RXCSUM) && (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) && (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && cqe->checksum == cpu_to_be16(0xffff)) return PKT_HASH_TYPE_L4; return PKT_HASH_TYPE_L3; } bpf_xdp_metadata_rx_csum_supported() { return dev->features & NETIF_F_RXCSUM; } bpf_xdp_metadata_rx_csum_level() { if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) && (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && cqe->checksum == cpu_to_be16(0xffff)) return CHECKSUM_UNNECESSARY; if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP && (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) && !short_frame(len)) return CHECKSUM_COMPLETE; /* we could also omit this case entirely */ return CHECKSUM_NONE; } /* this one could be called by the metadata_to_skb code */ bpf_xdp_metadata_rx_csum_full() { return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */ } /* this one would be for people like John who want to re-implement * check_csum() themselves */ bpf_xdp_metdata_rx_csum_raw() { return cqe->checksum; } -Toke
On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Stanislav Fomichev <sdf@google.com> writes: > > >> > Doesn't look like the descriptors are as nice as you're trying to > >> > paint them (with clear hash/csum fields) :-) So not sure how much > >> > CO-RE would help. > >> > At least looking at mlx4 rx_csum, the driver consults three different > >> > sets of flags to figure out the hash_type. Or am I just unlucky with > >> > mlx4? > >> > >> Which part are you talking about ? > >> hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer > >> which is what John is proposing (I think). > > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch > > I'm assuming we want to have hash_type available to the progs? > > I agree we should expose the hash_type, but that doesn't actually look > to be that complicated, see below. > > > But also, check_csum handles other corner cases: > > - short_frame: we simply force all those small frames to skip checksum complete > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from > > IPv6 header > > - get_fixed_ipv4_csum: Although the stack expects checksum which > > doesn't include the pseudo header, the HW adds it > > > > So it doesn't look like we can just unconditionally use cqe->checksum? > > The driver does a lot of massaging around that field to make it > > palatable. > > Poking around a bit in the other drivers, AFAICT it's only a subset of > drivers that support CSUM_COMPLETE at all; for instance, the Intel > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to > propagate? > > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access > to other data structures than the rx descriptor to determine the status > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so > just exposing the rx descriptor to BPF as John is suggesting does not > actually give the XDP program enough information to act on the checksum > field on its own. We could still have a separate kfunc to just expose > the hw checksum value (see below), but I think it probably needs to be > paired with other kfuncs to be useful. > > Looking at the mlx4 code, I think the following mapping to kfuncs (in > pseudo-C) would give the flexibility for XDP to access all the bits it > needs, while inlining everything except getting the full checksum for > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other > drivers (mlx5, ice, i40e) indicates that this would work for those > drivers as well. > > > bpf_xdp_metadata_rx_hash_supported() { > return dev->features & NETIF_F_RXHASH; > } > > bpf_xdp_metadata_rx_hash() { > return be32_to_cpu(cqe->immed_rss_invalid); > } > > bpf_xdp_metdata_rx_hash_type() { > if (likely(dev->features & NETIF_F_RXCSUM) && > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) && > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && > cqe->checksum == cpu_to_be16(0xffff)) > return PKT_HASH_TYPE_L4; > > return PKT_HASH_TYPE_L3; > } > > bpf_xdp_metadata_rx_csum_supported() { > return dev->features & NETIF_F_RXCSUM; > } > > bpf_xdp_metadata_rx_csum_level() { > if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | > MLX4_CQE_STATUS_UDP)) && > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && > cqe->checksum == cpu_to_be16(0xffff)) > return CHECKSUM_UNNECESSARY; > > if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP && > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) && > !short_frame(len)) > return CHECKSUM_COMPLETE; /* we could also omit this case entirely */ > > return CHECKSUM_NONE; > } > > /* this one could be called by the metadata_to_skb code */ > bpf_xdp_metadata_rx_csum_full() { > return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */ > } > > /* this one would be for people like John who want to re-implement > * check_csum() themselves */ > bpf_xdp_metdata_rx_csum_raw() { > return cqe->checksum; > } Are you proposing a bunch of per-driver kfuncs that bpf prog will call. If so that works, but bpf prog needs to pass dev and cqe pointers into these kfuncs, so they need to be exposed to the prog somehow. Probably through xdp_md ? This way we can have both: bpf prog reading cqe fields directly and using kfuncs to access things. Inlining of kfuncs should be done generically. It's not a driver job to convert native asm into bpf asm.
On Wed, Nov 16, 2022 at 10:55 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Stanislav Fomichev wrote: > > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > Stanislav Fomichev wrote: > > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > > > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > > > > > > >>>>>>> +{ > > > > > > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > > > > > > >>>>>>> + /* return true; */ > > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > > > > > > >>>>>>> + } > > > > > > > > > > >>>>>>> +} > > > > > > > > > > >>>>>> > > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > > > > > > >>>>> > > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > > > > > > >>>>> point? > > > > > > > > > > >>>> > > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > > > > > > >>> > > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > > > > > > >>> whether that complexity is manageable or not :-) > > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > > > > > > >>> static_branch to avoid the fastpath cost? > > > > > > > > > > >> > > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > > > > > > >> > > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > > > > > > >> > > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > > > > > > >> { > > > > > > > > > > >> u64 hi, lo; > > > > > > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > > > > > > >> > > > > > > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > > > > > > >> > > > > > > > > > > >> return hi | lo; > > > > > > > > > > >> } > > > > > > > > > > >> > > > > > > > > > > >> but converting that to nsec is a bit annoying, > > > > > > > > > > >> > > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > > > > > > >> u64 timestamp) > > > > > > > > > > >> { > > > > > > > > > > >> unsigned int seq; > > > > > > > > > > >> u64 nsec; > > > > > > > > > > >> > > > > > > > > > > >> do { > > > > > > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > > > > > > >> > > > > > > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > > > > > > >> } > > > > > > > > > > >> > > > > > > > > > > >> I think the nsec is what you really want. > > > > > > > > > > >> > > > > > > > > > > >> With all the drivers doing slightly different ops we would have > > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > > > > > > >> > > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > > > > > > >> and not BPF insns directly. > > > > > > > > > > > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > > > > > > u64 timestamp) > > > > > > > > > > > { > > > > > > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > > > > > > the metadata, that will end up having a significant performance impact > > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > > > > > > which will collect several bits of metadata). > > > > > > > > > > > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > > > > > > series against Jesper's; which I think we should definitely be doing > > > > > > > > > > before merging this. > > > > > > > > > > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > > > > > > a use case for it now. > > > > > > > > > > > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > > > > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > > > > > > rx side for a v2. > > > > > > > > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > > > > > > they have to keep that read in sync with the descriptors? Also need to > > > > > > > handle versioning of descriptors where depending on specific options > > > > > > > and firmware and chip being enabled the descriptor might be moving > > > > > > > around. Of course can push this all to developer, but seems not so > > > > > > > nice when we have the machinery to do this and we handle it for all > > > > > > > other structures. > > > > > > > > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > > > > > > expect CO-RE sorts it out based on currently running btf_id of the > > > > > > > descriptor. If you go through normal path you get this for free of > > > > > > > course. > > > > > > > > > > > > Doesn't look like the descriptors are as nice as you're trying to > > > > > > paint them (with clear hash/csum fields) :-) So not sure how much > > > > > > CO-RE would help. > > > > > > At least looking at mlx4 rx_csum, the driver consults three different > > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with > > > > > > mlx4? > > > > > > > > > > Which part are you talking about ? > > > > > hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer > > > > > which is what John is proposing (I think). > > Yeah this is what I've been considering. If you just get the 'cqe' pointer > walking the check_sum and rxhash should be straightforward. > > There seems to be a real difference between timestamps and most other > fields IMO. Timestamps require some extra logic to turn into ns when > using the NIC hw clock. And the hw clock requires some coordination to > keep in sync and stop from overflowing and may be done through other > protocols like PTP in my use case. In some cases I think the clock is > also shared amongst multiple phys. Seems mlx has a seqlock_t to protect > it and I'm less sure about this but seems intel nic maybe has a sideband > control channel. > > Then there is everything else that I can think up that is per packet data > and requires no coordination with the driver. Its just reading fields in > the completion queue. This would be the csum, check_sum, vlan_header and > so on. Sure we could kfunc each one of those things, but could also just > write that directly in BPF and remove some abstractions and kernel > dependency by doing it directly in the BPF program. If you like that > abstraction seems to be the point of contention my opinion is the cost > of kernel depency is high and I can abstract it with a user library > anyways so burying it in the kernel only causes me support issues and > backwards compat problems. > > Hopefully, my position is more clear. Yeah, I see your point, I'm somewhat in the same position here wrt to legacy kernels :-) Exposing raw descriptors seems fine, but imo it shouldn't be the goto mechanism for the metadata; but rather as a fallback whenever the driver implementation is missing/buggy. Unless, as you mention below, there are some libraries in the future to abstract that. But at least it seems that we agree that there needs to be some other non-raw-descriptor way to access spinlocked things like the timestamp? > > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. > > > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch > > > > I'm assuming we want to have hash_type available to the progs? > > > > > > > > But also, check_csum handles other corner cases: > > > > - short_frame: we simply force all those small frames to skip checksum complete > > > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from > > > > IPv6 header > > > > - get_fixed_ipv4_csum: Although the stack expects checksum which > > > > doesn't include the pseudo header, the HW adds it > > > > > > Of course, but kfunc won't be doing them either. > > > We're talking XDP fast path. > > > The mlx4 hw is old and incapable. > > > No amount of sw can help. > > Doesn't this lend itself to letting the XDP BPF program write the > BPF code to read it out. Maybe someone cares about these details > for some cpumap thing, but the rest of us wont care we might just > want to read check_csum. Maybe we have an IPv6 only network or > IPv4 network so can make further shortcuts. If a driver dev does > this they will be forced to do the cactch all version because > they have no way to know such details. > > > > > So it doesn't look like we can just unconditionally use cqe->checksum? > > > > The driver does a lot of massaging around that field to make it > > > > palatable. > > > > > > Of course we can. cqe->checksum is still usable. the bpf prog > > > would need to know what it's reading. > > > There should be no attempt to present a unified state of hw bits. > > > That's what skb is for. XDP layer should not hide such hw details. > > > Otherwise it will become a mini skb layer with all that overhead. > > > > I was hoping the kfunc could at least parse the flags and return some > > pkt_hash_types-like enum to indicate what this csum covers. > > So the users won't have to find the hardware manuals (not sure they > > are even available?) to decipher what numbers they've got. > > Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look > > that much different :-( > > The driver developers could still provide and ship the BPF libs > with their drivers. I think if someone is going to use their NIC > and lots of them and requires XDP it will get done. We could put > them by the driver code mlx4.bpf or something. > > > > > But going back a bit: I'm probably missing what John has been > > suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing > > a CO-RE-like functionality by rewriting some "public api" (kfunc) into > > the bytecode to access the relevant data. > > This was maybe a bit of an aside. What I was pondering a bit out > loud perhaps is my recollection is there may be a few different > descriptor layouts depending features enabled, exact device loaded > and such. So in this case if I was a driver writer I might not want > to hardcode the offset of the check_sum field. If I could use CO-RE > then I don't have to care if in one version is the Nth field and later on > someone makes it the Mth field just like any normal kernel struct. > But through the kfunc interface I couldn't see how to get that. > So instead of having a bunch of kfunc implementations you could just > have one for all your device classes because you always name the > field the same thing.
On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > > Stanislav Fomichev <sdf@google.com> writes: > > > > >> > Doesn't look like the descriptors are as nice as you're trying to > > >> > paint them (with clear hash/csum fields) :-) So not sure how much > > >> > CO-RE would help. > > >> > At least looking at mlx4 rx_csum, the driver consults three different > > >> > sets of flags to figure out the hash_type. Or am I just unlucky with > > >> > mlx4? > > >> > > >> Which part are you talking about ? > > >> hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer > > >> which is what John is proposing (I think). > > > > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. > > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch > > > I'm assuming we want to have hash_type available to the progs? > > > > I agree we should expose the hash_type, but that doesn't actually look > > to be that complicated, see below. > > > > > But also, check_csum handles other corner cases: > > > - short_frame: we simply force all those small frames to skip checksum complete > > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from > > > IPv6 header > > > - get_fixed_ipv4_csum: Although the stack expects checksum which > > > doesn't include the pseudo header, the HW adds it > > > > > > So it doesn't look like we can just unconditionally use cqe->checksum? > > > The driver does a lot of massaging around that field to make it > > > palatable. > > > > Poking around a bit in the other drivers, AFAICT it's only a subset of > > drivers that support CSUM_COMPLETE at all; for instance, the Intel > > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the > > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to > > propagate? > > > > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access > > to other data structures than the rx descriptor to determine the status > > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so > > just exposing the rx descriptor to BPF as John is suggesting does not > > actually give the XDP program enough information to act on the checksum > > field on its own. We could still have a separate kfunc to just expose > > the hw checksum value (see below), but I think it probably needs to be > > paired with other kfuncs to be useful. > > > > Looking at the mlx4 code, I think the following mapping to kfuncs (in > > pseudo-C) would give the flexibility for XDP to access all the bits it > > needs, while inlining everything except getting the full checksum for > > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other > > drivers (mlx5, ice, i40e) indicates that this would work for those > > drivers as well. > > > > > > bpf_xdp_metadata_rx_hash_supported() { > > return dev->features & NETIF_F_RXHASH; > > } > > > > bpf_xdp_metadata_rx_hash() { > > return be32_to_cpu(cqe->immed_rss_invalid); > > } > > > > bpf_xdp_metdata_rx_hash_type() { > > if (likely(dev->features & NETIF_F_RXCSUM) && > > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) && > > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && > > cqe->checksum == cpu_to_be16(0xffff)) > > return PKT_HASH_TYPE_L4; > > > > return PKT_HASH_TYPE_L3; > > } > > > > bpf_xdp_metadata_rx_csum_supported() { > > return dev->features & NETIF_F_RXCSUM; > > } > > > > bpf_xdp_metadata_rx_csum_level() { > > if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | > > MLX4_CQE_STATUS_UDP)) && > > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && > > cqe->checksum == cpu_to_be16(0xffff)) > > return CHECKSUM_UNNECESSARY; > > > > if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP && > > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) && > > !short_frame(len)) > > return CHECKSUM_COMPLETE; /* we could also omit this case entirely */ > > > > return CHECKSUM_NONE; > > } > > > > /* this one could be called by the metadata_to_skb code */ > > bpf_xdp_metadata_rx_csum_full() { > > return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */ > > } > > > > /* this one would be for people like John who want to re-implement > > * check_csum() themselves */ > > bpf_xdp_metdata_rx_csum_raw() { > > return cqe->checksum; > > } > > Are you proposing a bunch of per-driver kfuncs that bpf prog will call. > If so that works, but bpf prog needs to pass dev and cqe pointers > into these kfuncs, so they need to be exposed to the prog somehow. > Probably through xdp_md ? So far I'm doing: struct mlx4_xdp_buff { struct xdp_buff xdp; struct mlx4_cqe *cqe; struct mlx4_en_dev *mdev; } And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can find cqe/mdev via container_of. If we really need these to be exposed to the program, can we use Yonghong's approach from [0]? 0: https://lore.kernel.org/bpf/20221114162328.622665-1-yhs@fb.com/ > This way we can have both: bpf prog reading cqe fields directly > and using kfuncs to access things. > Inlining of kfuncs should be done generically. > It's not a driver job to convert native asm into bpf asm. Ack. I can replace the unrolling with something that just resolves "generic" kfuncs to the per-driver implementation maybe? That would at least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..
Stanislav Fomichev wrote: > On Wed, Nov 16, 2022 at 10:55 PM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Stanislav Fomichev wrote: > > > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > Stanislav Fomichev wrote: > > > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes: > > > > > > > > > > > > > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > > > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > > > > > > > >>>>>>> +{ > > > > > > > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > > > > > > > >>>>>>> + /* return true; */ > > > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > > > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > > > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > > > > > > > >>>>>>> + } > > > > > > > > > > > >>>>>>> +} > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > > > > > > > >>>>> > > > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > > > > > > > >>>>> point? > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > > > > > > > >>> > > > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > > > > > > > >>> whether that complexity is manageable or not :-) > > > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > > > > > > > >>> static_branch to avoid the fastpath cost? > > > > > > > > > > > >> > > > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > > > > > > > >> > > > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > > > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > > > > > > > >> > > > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > > > > > > > >> { > > > > > > > > > > > >> u64 hi, lo; > > > > > > > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > > > > > > > >> > > > > > > > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > > > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > > > > > > > >> > > > > > > > > > > > >> return hi | lo; > > > > > > > > > > > >> } > > > > > > > > > > > >> > > > > > > > > > > > >> but converting that to nsec is a bit annoying, > > > > > > > > > > > >> > > > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > > > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > > > > > > > >> u64 timestamp) > > > > > > > > > > > >> { > > > > > > > > > > > >> unsigned int seq; > > > > > > > > > > > >> u64 nsec; > > > > > > > > > > > >> > > > > > > > > > > > >> do { > > > > > > > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > > > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > > > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > > > > > > > >> > > > > > > > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > > > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > > > > > > > >> } > > > > > > > > > > > >> > > > > > > > > > > > >> I think the nsec is what you really want. > > > > > > > > > > > >> > > > > > > > > > > > >> With all the drivers doing slightly different ops we would have > > > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > > > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > > > > > > > >> > > > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > > > > > > > >> and not BPF insns directly. > > > > > > > > > > > > > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > > > > > > > u64 timestamp) > > > > > > > > > > > > { > > > > > > > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > > > > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > > > > > > > the metadata, that will end up having a significant performance impact > > > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > > > > > > > which will collect several bits of metadata). > > > > > > > > > > > > > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > > > > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > > > > > > > series against Jesper's; which I think we should definitely be doing > > > > > > > > > > > before merging this. > > > > > > > > > > > > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > > > > > > > a use case for it now. > > > > > > > > > > > > > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > > > > > > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > > > > > > > rx side for a v2. > > > > > > > > > > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > > > > > > > they have to keep that read in sync with the descriptors? Also need to > > > > > > > > handle versioning of descriptors where depending on specific options > > > > > > > > and firmware and chip being enabled the descriptor might be moving > > > > > > > > around. Of course can push this all to developer, but seems not so > > > > > > > > nice when we have the machinery to do this and we handle it for all > > > > > > > > other structures. > > > > > > > > > > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > > > > > > > expect CO-RE sorts it out based on currently running btf_id of the > > > > > > > > descriptor. If you go through normal path you get this for free of > > > > > > > > course. > > > > > > > > > > > > > > Doesn't look like the descriptors are as nice as you're trying to > > > > > > > paint them (with clear hash/csum fields) :-) So not sure how much > > > > > > > CO-RE would help. > > > > > > > At least looking at mlx4 rx_csum, the driver consults three different > > > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with > > > > > > > mlx4? > > > > > > > > > > > > Which part are you talking about ? > > > > > > hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > > > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer > > > > > > which is what John is proposing (I think). > > > > Yeah this is what I've been considering. If you just get the 'cqe' pointer > > walking the check_sum and rxhash should be straightforward. > > > > There seems to be a real difference between timestamps and most other > > fields IMO. Timestamps require some extra logic to turn into ns when > > using the NIC hw clock. And the hw clock requires some coordination to > > keep in sync and stop from overflowing and may be done through other > > protocols like PTP in my use case. In some cases I think the clock is > > also shared amongst multiple phys. Seems mlx has a seqlock_t to protect > > it and I'm less sure about this but seems intel nic maybe has a sideband > > control channel. > > > > Then there is everything else that I can think up that is per packet data > > and requires no coordination with the driver. Its just reading fields in > > the completion queue. This would be the csum, check_sum, vlan_header and > > so on. Sure we could kfunc each one of those things, but could also just > > write that directly in BPF and remove some abstractions and kernel > > dependency by doing it directly in the BPF program. If you like that > > abstraction seems to be the point of contention my opinion is the cost > > of kernel depency is high and I can abstract it with a user library > > anyways so burying it in the kernel only causes me support issues and > > backwards compat problems. > > > > Hopefully, my position is more clear. > > Yeah, I see your point, I'm somewhat in the same position here wrt to > legacy kernels :-) > Exposing raw descriptors seems fine, but imo it shouldn't be the goto > mechanism for the metadata; but rather as a fallback whenever the > driver implementation is missing/buggy. Unless, as you mention below, > there are some libraries in the future to abstract that. > But at least it seems that we agree that there needs to be some other > non-raw-descriptor way to access spinlocked things like the timestamp? > Yeah for timestamps I think a kfunc to either get the timestamp or could also be done with a kfunc to read hw clock. But either way seems hard to do that in BPF code directly so kfunc feels right to me here. By the way I think if you have the completion queue (rx descriptor) in the xdp_buff and we use Yonghong's patch to cast the ctx as a BTF type then we should be able to also directly read all the fields. I see you noted this in the response to Alexei so lets see what he thinks.
On Thu, Nov 17, 2022 at 11:47 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Yeah for timestamps I think a kfunc to either get the timestamp or could > also be done with a kfunc to read hw clock. But either way seems hard > to do that in BPF code directly so kfunc feels right to me here. > > By the way I think if you have the completion queue (rx descriptor) in > the xdp_buff and we use Yonghong's patch to cast the ctx as a BTF type > then we should be able to also directly read all the fields. I see > you noted this in the response to Alexei so lets see what he thinks. Fine with me. Let's land something that is not uapi and then iterate on top.
Stanislav Fomichev <sdf@google.com> writes: > On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> > >> > Stanislav Fomichev <sdf@google.com> writes: >> > >> > >> > Doesn't look like the descriptors are as nice as you're trying to >> > >> > paint them (with clear hash/csum fields) :-) So not sure how much >> > >> > CO-RE would help. >> > >> > At least looking at mlx4 rx_csum, the driver consults three different >> > >> > sets of flags to figure out the hash_type. Or am I just unlucky with >> > >> > mlx4? >> > >> >> > >> Which part are you talking about ? >> > >> hw_checksum = csum_unfold((__force __sum16)cqe->checksum); >> > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer >> > >> which is what John is proposing (I think). >> > > >> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. >> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch >> > > I'm assuming we want to have hash_type available to the progs? >> > >> > I agree we should expose the hash_type, but that doesn't actually look >> > to be that complicated, see below. >> > >> > > But also, check_csum handles other corner cases: >> > > - short_frame: we simply force all those small frames to skip checksum complete >> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from >> > > IPv6 header >> > > - get_fixed_ipv4_csum: Although the stack expects checksum which >> > > doesn't include the pseudo header, the HW adds it >> > > >> > > So it doesn't look like we can just unconditionally use cqe->checksum? >> > > The driver does a lot of massaging around that field to make it >> > > palatable. >> > >> > Poking around a bit in the other drivers, AFAICT it's only a subset of >> > drivers that support CSUM_COMPLETE at all; for instance, the Intel >> > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the >> > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to >> > propagate? >> > >> > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access >> > to other data structures than the rx descriptor to determine the status >> > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so >> > just exposing the rx descriptor to BPF as John is suggesting does not >> > actually give the XDP program enough information to act on the checksum >> > field on its own. We could still have a separate kfunc to just expose >> > the hw checksum value (see below), but I think it probably needs to be >> > paired with other kfuncs to be useful. >> > >> > Looking at the mlx4 code, I think the following mapping to kfuncs (in >> > pseudo-C) would give the flexibility for XDP to access all the bits it >> > needs, while inlining everything except getting the full checksum for >> > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other >> > drivers (mlx5, ice, i40e) indicates that this would work for those >> > drivers as well. >> > >> > >> > bpf_xdp_metadata_rx_hash_supported() { >> > return dev->features & NETIF_F_RXHASH; >> > } >> > >> > bpf_xdp_metadata_rx_hash() { >> > return be32_to_cpu(cqe->immed_rss_invalid); >> > } >> > >> > bpf_xdp_metdata_rx_hash_type() { >> > if (likely(dev->features & NETIF_F_RXCSUM) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && >> > cqe->checksum == cpu_to_be16(0xffff)) >> > return PKT_HASH_TYPE_L4; >> > >> > return PKT_HASH_TYPE_L3; >> > } >> > >> > bpf_xdp_metadata_rx_csum_supported() { >> > return dev->features & NETIF_F_RXCSUM; >> > } >> > >> > bpf_xdp_metadata_rx_csum_level() { >> > if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | >> > MLX4_CQE_STATUS_UDP)) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && >> > cqe->checksum == cpu_to_be16(0xffff)) >> > return CHECKSUM_UNNECESSARY; >> > >> > if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) && >> > !short_frame(len)) >> > return CHECKSUM_COMPLETE; /* we could also omit this case entirely */ >> > >> > return CHECKSUM_NONE; >> > } >> > >> > /* this one could be called by the metadata_to_skb code */ >> > bpf_xdp_metadata_rx_csum_full() { >> > return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */ >> > } >> > >> > /* this one would be for people like John who want to re-implement >> > * check_csum() themselves */ >> > bpf_xdp_metdata_rx_csum_raw() { >> > return cqe->checksum; >> > } >> >> Are you proposing a bunch of per-driver kfuncs that bpf prog will call. >> If so that works, but bpf prog needs to pass dev and cqe pointers >> into these kfuncs, so they need to be exposed to the prog somehow. >> Probably through xdp_md ? No, I didn't mean we should call per-driver kfuncs; the examples above were meant to be examples of what the mlx4 driver would unrolls those kfuncs to. Sorry that that wasn't clear. > So far I'm doing: > > struct mlx4_xdp_buff { > struct xdp_buff xdp; > struct mlx4_cqe *cqe; > struct mlx4_en_dev *mdev; > } > > And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can > find cqe/mdev via container_of. > > If we really need these to be exposed to the program, can we use > Yonghong's approach from [0]? I don't think we should expose them to the BPF program; I like your approach of stuffing them next to the CTX pointer and de-referencing that. This makes it up to the driver which extra objects it needs, and the caller doesn't have to know/care. I'm not vehemently opposed to *also* having the rx-desc pointer directly accessible (in which case Yonghong's kfunc approach is probably fine). However, as mentioned in my previous email, I doubt how useful that descriptor itself will be... >> This way we can have both: bpf prog reading cqe fields directly >> and using kfuncs to access things. >> Inlining of kfuncs should be done generically. >> It's not a driver job to convert native asm into bpf asm. > > Ack. I can replace the unrolling with something that just resolves > "generic" kfuncs to the per-driver implementation maybe? That would at > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime.. As stated above, I think we should keep the unrolling. If we end up with an actual CALL instruction for every piece of metadata that's going to suck performance-wise; unrolling is how we keep this fast enough! :) -Toke
On Thu, Nov 17, 2022 at 3:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > > > Ack. I can replace the unrolling with something that just resolves > > "generic" kfuncs to the per-driver implementation maybe? That would at > > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime.. > > As stated above, I think we should keep the unrolling. If we end up with > an actual CALL instruction for every piece of metadata that's going to > suck performance-wise; unrolling is how we keep this fast enough! :) Let's start with pure kfuncs without requiring drivers to provide corresponding bpf asm. If pure kfuncs will indeed turn out to be perf limiting then we'll decide on how to optimize them.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Thu, Nov 17, 2022 at 3:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> > >> > Ack. I can replace the unrolling with something that just resolves >> > "generic" kfuncs to the per-driver implementation maybe? That would at >> > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime.. >> >> As stated above, I think we should keep the unrolling. If we end up with >> an actual CALL instruction for every piece of metadata that's going to >> suck performance-wise; unrolling is how we keep this fast enough! :) > > Let's start with pure kfuncs without requiring drivers > to provide corresponding bpf asm. > If pure kfuncs will indeed turn out to be perf limiting > then we'll decide on how to optimize them. I'm ~90% sure we'll need that optimisation, but OK, we can start from a baseline of having them be function calls... -Toke
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 2a4592780141..c626580a2294 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -25,6 +25,7 @@ #include <linux/filter.h> #include <linux/ptr_ring.h> #include <linux/bpf_trace.h> +#include <linux/bpf_patch.h> #include <linux/net_tstamp.h> #define DRV_NAME "veth" @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) } } +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, + struct bpf_patch *patch) +{ + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { + /* return true; */ + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { + /* return ktime_get_mono_fast_ns(); */ + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); + } +} + static const struct net_device_ops veth_netdev_ops = { .ndo_init = veth_dev_init, .ndo_open = veth_open, @@ -1678,6 +1691,7 @@ static const struct net_device_ops veth_netdev_ops = { .ndo_bpf = veth_xdp, .ndo_xdp_xmit = veth_ndo_xdp_xmit, .ndo_get_peer_dev = veth_peer_dev, + .ndo_unroll_kfunc = veth_unroll_kfunc, }; #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
The goal is to enable end-to-end testing of the metadata for AF_XDP. Current rx_timestamp kfunc returns current time which should be enough to exercise this new functionality. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> --- drivers/net/veth.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)