diff mbox series

Does skb_metadata_differs really need to stop GRO aggregation?

Message ID 92a355bd-7105-4a17-9543-ba2d8ae36a37@kernel.org (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Does skb_metadata_differs really need to stop GRO aggregation? | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Jesper Dangaard Brouer Nov. 28, 2023, 12:37 p.m. UTC
Hi Daniel,

I'm trying to understand why skb_metadata_differs() needed to block GRO ?

I was looking at XDP storing information in metadata area that also
survives into SKBs layer.  E.g. the RX timestamp.

Then I noticed that GRO code (gro_list_prepare) will not allow
aggregating if metadata isn't the same in all packets via
skb_metadata_differs().  Is this really needed?
Can we lift/remove this limitation?

E.g. if I want to store a timestamp, then it will differ per packet.

--Jesper

Git history says it dates back to the original commit that added meta
pointer de8f3a83b0a0 ("bpf: add meta pointer for direct access") (author
Daniel).

Comments

Jesper Dangaard Brouer Nov. 28, 2023, 1:06 p.m. UTC | #1
On 11/28/23 13:37, Jesper Dangaard Brouer wrote:
> Hi Daniel,
> 
> I'm trying to understand why skb_metadata_differs() needed to block GRO ?
> 
> I was looking at XDP storing information in metadata area that also
> survives into SKBs layer.  E.g. the RX timestamp.
> 
> Then I noticed that GRO code (gro_list_prepare) will not allow
> aggregating if metadata isn't the same in all packets via
> skb_metadata_differs().  Is this really needed?
> Can we lift/remove this limitation?
> 

(Answering myself)
I understand/see now, that when an SKB gets GRO aggregated, I will
"lose" access to the metadata information and only have access to the
metadata in the "first" SKB.
Thus, GRO layer still needs this check and it cannot know if the info
was important or not.

I wonder if there is a BPF hook, prior to GRO step, that could allow me
to extract variable metadata and zero it out before GRO step.


> E.g. if I want to store a timestamp, then it will differ per packet.
> 
> --Jesper
> 
> Git history says it dates back to the original commit that added meta
> pointer de8f3a83b0a0 ("bpf: add meta pointer for direct access") (author
> Daniel).
> 
> 
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 0759277dc14e..7fb6a6a24288 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -341,7 +341,7 @@ static void gro_list_prepare(const struct list_head 
> *head,
> 
>                  diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
>                  diffs |= p->vlan_all ^ skb->vlan_all;
> -               diffs |= skb_metadata_differs(p, skb);
> +               diffs |= skb_metadata_differs(p, skb); // Why?
>                  if (maclen == ETH_HLEN)
>                          diffs |= compare_ether_header(skb_mac_header(p),
Daniel Borkmann Nov. 28, 2023, 1:30 p.m. UTC | #2
On 11/28/23 2:06 PM, Jesper Dangaard Brouer wrote:
> On 11/28/23 13:37, Jesper Dangaard Brouer wrote:
>> Hi Daniel,
>>
>> I'm trying to understand why skb_metadata_differs() needed to block GRO ?
>>
>> I was looking at XDP storing information in metadata area that also
>> survives into SKBs layer.  E.g. the RX timestamp.
>>
>> Then I noticed that GRO code (gro_list_prepare) will not allow
>> aggregating if metadata isn't the same in all packets via
>> skb_metadata_differs().  Is this really needed?
>> Can we lift/remove this limitation?
> 
> (Answering myself)
> I understand/see now, that when an SKB gets GRO aggregated, I will
> "lose" access to the metadata information and only have access to the
> metadata in the "first" SKB.
> Thus, GRO layer still needs this check and it cannot know if the info
> was important or not.

^ This exactly in order to avoid loosing information for the upper stack. I'm
not sure if there is an alternative scheme we could do where BPF prog can tell
'it's okay to loose meta data if skb can get aggregated', and then we just skip
the below skb_metadata_differs() check. We could probably encode a flag in the
meta_len given the latter requires 4 byte alignment. Then BPF prog can decide.

> I wonder if there is a BPF hook, prior to GRO step, that could allow me
> to extract variable metadata and zero it out before GRO step.
> 
>> E.g. if I want to store a timestamp, then it will differ per packet.
>>
>> --Jesper
>>
>> Git history says it dates back to the original commit that added meta
>> pointer de8f3a83b0a0 ("bpf: add meta pointer for direct access") (author
>> Daniel).
>>
>>
>> diff --git a/net/core/gro.c b/net/core/gro.c
>> index 0759277dc14e..7fb6a6a24288 100644
>> --- a/net/core/gro.c
>> +++ b/net/core/gro.c
>> @@ -341,7 +341,7 @@ static void gro_list_prepare(const struct list_head *head,
>>
>>                  diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
>>                  diffs |= p->vlan_all ^ skb->vlan_all;
>> -               diffs |= skb_metadata_differs(p, skb);
>> +               diffs |= skb_metadata_differs(p, skb); // Why?
>>                  if (maclen == ETH_HLEN)
>>                          diffs |= compare_ether_header(skb_mac_header(p),
>
Toke Høiland-Jørgensen Nov. 28, 2023, 2:39 p.m. UTC | #3
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 11/28/23 2:06 PM, Jesper Dangaard Brouer wrote:
>> On 11/28/23 13:37, Jesper Dangaard Brouer wrote:
>>> Hi Daniel,
>>>
>>> I'm trying to understand why skb_metadata_differs() needed to block GRO ?
>>>
>>> I was looking at XDP storing information in metadata area that also
>>> survives into SKBs layer.  E.g. the RX timestamp.
>>>
>>> Then I noticed that GRO code (gro_list_prepare) will not allow
>>> aggregating if metadata isn't the same in all packets via
>>> skb_metadata_differs().  Is this really needed?
>>> Can we lift/remove this limitation?
>> 
>> (Answering myself)
>> I understand/see now, that when an SKB gets GRO aggregated, I will
>> "lose" access to the metadata information and only have access to the
>> metadata in the "first" SKB.
>> Thus, GRO layer still needs this check and it cannot know if the info
>> was important or not.
>
> ^ This exactly in order to avoid loosing information for the upper stack. I'm
> not sure if there is an alternative scheme we could do where BPF prog can tell
> 'it's okay to loose meta data if skb can get aggregated', and then we just skip
> the below skb_metadata_differs() check. We could probably encode a flag in the
> meta_len given the latter requires 4 byte alignment. Then BPF prog can
> decide.

A flag seems sane. I guess we could encode some flag values in the upper
bits of the 'offset' argument of the bpf_xdp_adjust_meta() helper, since
valid values are guaranteed to be pretty small anyway? :)

I'm not quite sure what should be the semantics of that, though. I.e.,
if you are trying to aggregate two packets that have the flag set, which
packet do you take the value from? What if only one packet has the flag
set? Or should we instead have a "metadata_xdp_only" flag that just
prevents the skb metadata field from being set entirely? Or would both
be useful?

-Toke
Edward Cree Nov. 29, 2023, 6:04 p.m. UTC | #4
On 28/11/2023 14:39, Toke Høiland-Jørgensen wrote:
> I'm not quite sure what should be the semantics of that, though. I.e.,
> if you are trying to aggregate two packets that have the flag set, which
> packet do you take the value from? What if only one packet has the flag
> set? Or should we instead have a "metadata_xdp_only" flag that just
> prevents the skb metadata field from being set entirely? Or would both
> be useful?

Sounds like what's actually needed is bpf progs inside the GRO engine
 to implement the metadata "protocol" prepare and coalesce callbacks?

-ed
Toke Høiland-Jørgensen Nov. 29, 2023, 9:52 p.m. UTC | #5
Edward Cree <ecree.xilinx@gmail.com> writes:

> On 28/11/2023 14:39, Toke Høiland-Jørgensen wrote:
>> I'm not quite sure what should be the semantics of that, though. I.e.,
>> if you are trying to aggregate two packets that have the flag set, which
>> packet do you take the value from? What if only one packet has the flag
>> set? Or should we instead have a "metadata_xdp_only" flag that just
>> prevents the skb metadata field from being set entirely? Or would both
>> be useful?
>
> Sounds like what's actually needed is bpf progs inside the GRO engine
>  to implement the metadata "protocol" prepare and coalesce callbacks?

Hmm, yes, I guess that would be the most general solution :)

-Toke
Daniel Borkmann Nov. 29, 2023, 11:10 p.m. UTC | #6
On 11/29/23 10:52 PM, Toke Høiland-Jørgensen wrote:
> Edward Cree <ecree.xilinx@gmail.com> writes:
>> On 28/11/2023 14:39, Toke Høiland-Jørgensen wrote:
>>> I'm not quite sure what should be the semantics of that, though. I.e.,
>>> if you are trying to aggregate two packets that have the flag set, which
>>> packet do you take the value from? What if only one packet has the flag

It would probably make sense if both packets have it set.

>>> set? Or should we instead have a "metadata_xdp_only" flag that just
>>> prevents the skb metadata field from being set entirely?

What would be the use case compared to resetting meta data right before
we return with XDP_PASS?

>> Sounds like what's actually needed is bpf progs inside the GRO engine
>>   to implement the metadata "protocol" prepare and coalesce callbacks?
> 
> Hmm, yes, I guess that would be the most general solution :)

Feels like a potential good fit, agree, although for just solving the
above sth not requiring extra BPF might be nice as well.

Thanks,
Daniel
Toke Høiland-Jørgensen Nov. 30, 2023, 1:55 p.m. UTC | #7
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 11/29/23 10:52 PM, Toke Høiland-Jørgensen wrote:
>> Edward Cree <ecree.xilinx@gmail.com> writes:
>>> On 28/11/2023 14:39, Toke Høiland-Jørgensen wrote:
>>>> I'm not quite sure what should be the semantics of that, though. I.e.,
>>>> if you are trying to aggregate two packets that have the flag set, which
>>>> packet do you take the value from? What if only one packet has the flag
>
> It would probably make sense if both packets have it set.

Right, so "aggregate only if both packets have the flag set, keeping the
metadata area from the first packet", then?

>>>> set? Or should we instead have a "metadata_xdp_only" flag that just
>>>> prevents the skb metadata field from being set entirely?
>
> What would be the use case compared to resetting meta data right before
> we return with XDP_PASS?

I was thinking it could save a call to xdp_adjust_meta() to reset it
back to zero before PASSing the packet. But okay, that may be of
marginal utility.

>>> Sounds like what's actually needed is bpf progs inside the GRO engine
>>>   to implement the metadata "protocol" prepare and coalesce callbacks?
>> 
>> Hmm, yes, I guess that would be the most general solution :)
>
> Feels like a potential good fit, agree, although for just solving the
> above sth not requiring extra BPF might be nice as well.

Yeah, I agree that just the flag makes sense on its own.

-Toke
Daniel Borkmann Nov. 30, 2023, 4:32 p.m. UTC | #8
On 11/30/23 2:55 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes
>> On 11/29/23 10:52 PM, Toke Høiland-Jørgensen wrote:
>>> Edward Cree <ecree.xilinx@gmail.com> writes:
>>>> On 28/11/2023 14:39, Toke Høiland-Jørgensen wrote:
>>>>> I'm not quite sure what should be the semantics of that, though. I.e.,
>>>>> if you are trying to aggregate two packets that have the flag set, which
>>>>> packet do you take the value from? What if only one packet has the flag
>>
>> It would probably make sense if both packets have it set.
> 
> Right, so "aggregate only if both packets have the flag set, keeping the
> metadata area from the first packet", then?

Yes, sgtm.

>>>>> set? Or should we instead have a "metadata_xdp_only" flag that just
>>>>> prevents the skb metadata field from being set entirely?
>>
>> What would be the use case compared to resetting meta data right before
>> we return with XDP_PASS?
> 
> I was thinking it could save a call to xdp_adjust_meta() to reset it
> back to zero before PASSing the packet. But okay, that may be of
> marginal utility.

Agree, feels too marginal.

>>>> Sounds like what's actually needed is bpf progs inside the GRO engine
>>>>    to implement the metadata "protocol" prepare and coalesce callbacks?
>>>
>>> Hmm, yes, I guess that would be the most general solution :)
>>
>> Feels like a potential good fit, agree, although for just solving the
>> above sth not requiring extra BPF might be nice as well.
> 
> Yeah, I agree that just the flag makes sense on its own.
> 
> -Toke
>
Jesper Dangaard Brouer Nov. 30, 2023, 8:35 p.m. UTC | #9
On 11/30/23 17:32, Daniel Borkmann wrote:
> On 11/30/23 2:55 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes
>>> On 11/29/23 10:52 PM, Toke Høiland-Jørgensen wrote:
>>>> Edward Cree <ecree.xilinx@gmail.com> writes:
>>>>> On 28/11/2023 14:39, Toke Høiland-Jørgensen wrote:
>>>>>> I'm not quite sure what should be the semantics of that, though. 
>>>>>> I.e.,
>>>>>> if you are trying to aggregate two packets that have the flag set, 
>>>>>> which
>>>>>> packet do you take the value from? What if only one packet has the 
>>>>>> flag
>>>
>>> It would probably make sense if both packets have it set.
>>
>> Right, so "aggregate only if both packets have the flag set, keeping the
>> metadata area from the first packet", then?
> 
> Yes, sgtm.
> 

Seems like a good default behavior: "keeping the metadata area from the 
first packet".
(Please object if someone sees a issue for their use-case with this 
default.)


>>>>>> set? Or should we instead have a "metadata_xdp_only" flag that just
>>>>>> prevents the skb metadata field from being set entirely?
>>>
>>> What would be the use case compared to resetting meta data right before
>>> we return with XDP_PASS?
>>
>> I was thinking it could save a call to xdp_adjust_meta() to reset it
>> back to zero before PASSing the packet. But okay, that may be of
>> marginal utility.
> 
> Agree, feels too marginal.
>

I should explain our use-case(s) a bit more.
We do want the information to survive XDP_PASS into the SKB.
Its the hole point, as we want to transfer information from XDP layer to
TC-layer and perhaps further all the way to BPF socket filters (I even
heard someone asked for).

I'm trying to get an overview, as I now have multiple product teams that
want to store information across/into differ layer, and they have other
teams that consume this information.

We are exploring more options than only XDP metadata area to store
information.  I have suggested that once an SKB have a socket
associated, then we can switch into using BPF local socket storage
tricks. (The lifetime of XDP metadata is not 100% clear as e.g.
pskb_expand_head clears it via skb_metadata_clear).
All ideas are welcome, e.g. I'm also looking at ability to store
auxiliary/metadata data associated with a dst_entry. And SKB->mark is
already used for other use-cases and isn't big enough. (and then there
is fun crossing a netns boundry).

Let me explain *one* of the concrete use-cases.  As described in [1],
the CF XDP L4 load-balancer Unimog have been extended to a product
called Plurimog that does load-balancing across data-centers "colo's".
When Plurimog redirects to another colo, the original "landing" colo's
ID is carried across (in some encap header) to a Unimog instance.  Thus,
the original landing Colo ID is known to Unimog running in another colo,
but that header is popped, so this info need to be transferred somehow.
I'm told that even the webserver/Nginx need to know the orig/foreign
landing colo ID (here there should be socket associated). For TCP SYN
packets, the layered DOS protecting also need to know foreign landing
colo ID. Other teams/products needs this for accounting, e.g. Traffic
Manager[1], Radar[2] and Capacity planning.


  [1] https://blog.cloudflare.com/meet-traffic-manager/
  [2] https://radar.cloudflare.com/



>>>>> Sounds like what's actually needed is bpf progs inside the GRO engine
>>>>>    to implement the metadata "protocol" prepare and coalesce 
>>>>> callbacks?
>>>>
>>>> Hmm, yes, I guess that would be the most general solution :)
>>>
>>> Feels like a potential good fit, agree, although for just solving the
>>> above sth not requiring extra BPF might be nice as well.
>>
>> Yeah, I agree that just the flag makes sense on its own.

I've mentioned before (e.g. at NetConf) I would really like to see BPF
progs inside the GRO engine, but that is a larger project on its own.
I think it is worth doing eventually, but I likely need a solution to
unblock the "tracing"/debugging use-case, where someone added a
timestamp to XDP metadata and discovered GRO was not working.

I guess, we can do the Plurimog use-case now, as it should be stable for
packets belonging to the same (GRO) flow.

--Jesper
Martin KaFai Lau Nov. 30, 2023, 10 p.m. UTC | #10
On 11/30/23 12:35 PM, Jesper Dangaard Brouer wrote:
> I should explain our use-case(s) a bit more.
> We do want the information to survive XDP_PASS into the SKB.
> Its the hole point, as we want to transfer information from XDP layer to
> TC-layer and perhaps further all the way to BPF socket filters (I even
> heard someone asked for).
> 
> I'm trying to get an overview, as I now have multiple product teams that
> want to store information across/into differ layer, and they have other
> teams that consume this information.
> 
> We are exploring more options than only XDP metadata area to store
> information.  I have suggested that once an SKB have a socket
> associated, then we can switch into using BPF local socket storage
> tricks. (The lifetime of XDP metadata is not 100% clear as e.g.
> pskb_expand_head clears it via skb_metadata_clear).
> All ideas are welcome, e.g. I'm also looking at ability to store
> auxiliary/metadata data associated with a dst_entry. And SKB->mark is
> already used for other use-cases and isn't big enough. (and then there
> is fun crossing a netns boundry).
> 
> Let me explain *one* of the concrete use-cases.  As described in [1],
> the CF XDP L4 load-balancer Unimog have been extended to a product
> called Plurimog that does load-balancing across data-centers "colo's".
> When Plurimog redirects to another colo, the original "landing" colo's
> ID is carried across (in some encap header) to a Unimog instance.  Thus,
> the original landing Colo ID is known to Unimog running in another colo,
> but that header is popped, so this info need to be transferred somehow.
> I'm told that even the webserver/Nginx need to know the orig/foreign
> landing colo ID (here there should be socket associated). For TCP SYN
> packets, the layered DOS protecting also need to know foreign landing
> colo ID. Other teams/products needs this for accounting, e.g. Traffic
> Manager[1], Radar[2] and Capacity planning.

We also bumped into a usecase about plumbing the RX timestamp taken at XDP to 
its final "sk" for analysis purpose. The usecase had not materialized.

fwiw, one of my thoughts at that time is similar to your sk local storage 
thinking, do a bpf_sk_lookup_tcp at xdp and store the stats there. It will waste 
the lookup effort because there is no skb to do the bpf_sk_assign(). Then follow 
this direction of thought is to allocate a skb in the xdp prog itself if we know 
it is a XDP_PASS.

That said, the sk storage approach would work fine if whatever it wants to 
collect from xdp_md(s)/skb(s) can be stacked/aggregated in a sk. It would be 
nicer if the __sk_buff->data_meta can work more like other bpf local storage 
(sk, task, cgroup...etc) such that it will be available to other bpf prog type 
(e.g. a tracing prog).
Yan Zhai Dec. 1, 2023, 6:20 a.m. UTC | #11
On Thu, Nov 30, 2023 at 2:35 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
> We are exploring more options than only XDP metadata area to store
> information.  I have suggested that once an SKB have a socket
> associated, then we can switch into using BPF local socket storage
> tricks. (The lifetime of XDP metadata is not 100% clear as e.g.
> pskb_expand_head clears it via skb_metadata_clear).
> All ideas are welcome, e.g. I'm also looking at ability to store
> auxiliary/metadata data associated with a dst_entry. And SKB->mark is
> already used for other use-cases and isn't big enough. (and then there
> is fun crossing a netns boundry).
>
sk local storage might not work for the cases if packets are purely
forwarded or end up with a tun/tap device. Can we make XDP metadata
life time clear then? It would also be really interesting if we can
sendmsg with metadata, too. We often have a hard time distinguishing
if a kernel event like packet drop/retransmission correlates to a
reported customer problem. It's hard because when the event happens,
there isn't customer specific information in the context to correlate,
usually only multiplexed sockets and the packet triggering such an
event. Allowing carrying some extra information on the packet would
definitely improve this a lot with BPF tracing.

Yan
Yan Zhai Dec. 1, 2023, 5:09 p.m. UTC | #12
On Thu, Nov 30, 2023 at 10:32 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/30/23 2:55 PM, Toke Høiland-Jørgensen wrote:
> > Daniel Borkmann <daniel@iogearbox.net> writes
> >> On 11/29/23 10:52 PM, Toke Høiland-Jørgensen wrote:
> >>> Edward Cree <ecree.xilinx@gmail.com> writes:
> >>>> On 28/11/2023 14:39, Toke Høiland-Jørgensen wrote:
> >>>>> I'm not quite sure what should be the semantics of that, though. I.e.,
> >>>>> if you are trying to aggregate two packets that have the flag set, which
> >>>>> packet do you take the value from? What if only one packet has the flag
> >>
> >> It would probably make sense if both packets have it set.
> >
> > Right, so "aggregate only if both packets have the flag set, keeping the
> > metadata area from the first packet", then?
>
> Yes, sgtm.
>

There is one flaw for TCP in current implementation (before adding the
flag), which we experienced earlier in production: when metadata
differs on TCP packets, it not only disables GRO, but also reorder all
PSH packets. This happens because when metadata differs, the new
packet will be linked as a different node on the GRO merge list, since
NAPI_GRO_CB->same_flow is set to 0 for all previous packets. However,
packets with flags like PSH will be immediately flushed to the upper
stack, while its predecessor packets might still be waiting on the
merge list. I think it might make sense to first delay metadata
comparison before skb_gro_receive, then add the flag to determine if
the difference really matters.

Yan
diff mbox series

Patch

diff --git a/net/core/gro.c b/net/core/gro.c
index 0759277dc14e..7fb6a6a24288 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -341,7 +341,7 @@  static void gro_list_prepare(const struct list_head 
*head,

                 diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
                 diffs |= p->vlan_all ^ skb->vlan_all;
-               diffs |= skb_metadata_differs(p, skb);
+               diffs |= skb_metadata_differs(p, skb); // Why?
                 if (maclen == ETH_HLEN)
                         diffs |= compare_ether_header(skb_mac_header(p),