diff mbox series

Question: Move BPF_SK_LOOKUP ahead of connected UDP sk lookup?

Message ID 6e239bb7-b7f9-4a40-bd1d-a522d4b9529c@linux.alibaba.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Question: Move BPF_SK_LOOKUP ahead of connected UDP sk lookup? | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Philo Lu Aug. 20, 2024, 12:31 p.m. UTC
Hi all, I wonder if it is feasible to move BPF_SK_LOOKUP ahead of 
connected UDP sk lookup?

That is something like:
(i.e., move connected udp socket lookup behind bpf sk lookup prog)
```
             udptable == net->ipv4.udp_table) {
@@ -512,6 +505,13 @@ struct sock *__udp4_lib_lookup(const struct net 
*net, __be32 saddr,
                 }
         }

+       /* Lookup connected or non-wildcard socket */
+       result = udp4_lib_lookup2(net, saddr, sport,
+                                 daddr, hnum, dif, sdif,
+                                 hslot2, skb);
+       if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
+               goto done;
+
         /* Got non-wildcard socket or error on first lookup */
         if (result)
                 goto done;
```

This will be useful, e.g., if there are many concurrent udp sockets of a 
same ip:port, where udp4_lib_lookup2() may induce high softirq overhead, 
because it computes score for all sockets of the ip:port. With bpf 
sk_lookup prog, we can implement 4-tuple hash for udp socket lookup to 
solve the problem (if bpf prog runs before udp4_lib_lookup2).

Currently, in udp, bpf sk lookup runs after connected socket lookup. 
IIUC, this is because the early version of SK_LOOKUP[0] modified 
local_ip/local_port to redirect socket. This may interact wrongly with 
udp lookup because udp uses score to select socket, and setting 
local_ip/local_port cannot guarantee the result socket selected. 
However, now we get socket directly from map in bpf sk_lookup prog, so 
the above problem no longer exists.

So is there any other problem on it?Or I'll try to work on it and commit 
patches later.

[0]https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/

Thank you for your time.

Comments

Jakub Sitnicki Aug. 21, 2024, 9:23 a.m. UTC | #1
Hi Philo,

[CC Eric and Paolo who have more context than me here.]

On Tue, Aug 20, 2024 at 08:31 PM +08, Philo Lu wrote:
> Hi all, I wonder if it is feasible to move BPF_SK_LOOKUP ahead of connected UDP
> sk lookup?
>
> That is something like:
> (i.e., move connected udp socket lookup behind bpf sk lookup prog)
> ```
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ddb86baaea6c8..9a1408775bcb1 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -493,13 +493,6 @@ struct sock *__udp4_lib_lookup(const struct net *net,
> __be32 saddr,
>         slot2 = hash2 & udptable->mask;
>         hslot2 = &udptable->hash2[slot2];
>
> -       /* Lookup connected or non-wildcard socket */
> -       result = udp4_lib_lookup2(net, saddr, sport,
> -                                 daddr, hnum, dif, sdif,
> -                                 hslot2, skb);
> -       if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
> -               goto done;
> -
>         /* Lookup redirect from BPF */
>         if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
>             udptable == net->ipv4.udp_table) {
> @@ -512,6 +505,13 @@ struct sock *__udp4_lib_lookup(const struct net *net,
> __be32 saddr,
>                 }
>         }
>
> +       /* Lookup connected or non-wildcard socket */
> +       result = udp4_lib_lookup2(net, saddr, sport,
> +                                 daddr, hnum, dif, sdif,
> +                                 hslot2, skb);
> +       if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
> +               goto done;
> +
>         /* Got non-wildcard socket or error on first lookup */
>         if (result)
>                 goto done;
> ```
>
> This will be useful, e.g., if there are many concurrent udp sockets of a same
> ip:port, where udp4_lib_lookup2() may induce high softirq overhead, because it
> computes score for all sockets of the ip:port. With bpf sk_lookup prog, we can
> implement 4-tuple hash for udp socket lookup to solve the problem (if bpf prog
> runs before udp4_lib_lookup2).
>
> Currently, in udp, bpf sk lookup runs after connected socket lookup. IIUC, this
> is because the early version of SK_LOOKUP[0] modified local_ip/local_port to
> redirect socket. This may interact wrongly with udp lookup because udp uses
> score to select socket, and setting local_ip/local_port cannot guarantee the
> result socket selected. However, now we get socket directly from map in bpf
> sk_lookup prog, so the above problem no longer exists.
>
> So is there any other problem on it?Or I'll try to work on it and commit
> patches later.
>
> [0]https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
>
> Thank you for your time.

It was done like that to maintain the connected UDP socket guarantees.
Similarly to the established TCP sockets. The contract is that if you
are bound to a 4-tuple, you will receive the packets destined to it.

It sounds like you are looking for an efficient way to lookup a
connected UDP socket. We would be interested in that as well. We use
connected UDP/QUIC on egress where we don't expect the peer to roam and
change its address. There's a memory cost on the kernel side to using
them, but they make it easier to structure your application, because you
can have roughly the same design for TCP and UDP transport.

So what if instead of doing it in BPF, we make it better for everyone
and introduce a hash table keyed by 4-tuple for connected sockets in the
udp stack itself (counterpart of ehash in tcp)?

Thanks,
(the other) Jakub
Philo Lu Aug. 21, 2024, 11:44 a.m. UTC | #2
Hi Jakub,

On 2024/8/21 17:23, Jakub Sitnicki wrote:
> Hi Philo,
> 
> [CC Eric and Paolo who have more context than me here.]
> 
> On Tue, Aug 20, 2024 at 08:31 PM +08, Philo Lu wrote:
>> Hi all, I wonder if it is feasible to move BPF_SK_LOOKUP ahead of connected UDP
>> sk lookup?
>>
...
>>
>> So is there any other problem on it?Or I'll try to work on it and commit
>> patches later.
>>
>> [0]https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
>>
>> Thank you for your time.
> 
> It was done like that to maintain the connected UDP socket guarantees.
> Similarly to the established TCP sockets. The contract is that if you
> are bound to a 4-tuple, you will receive the packets destined to it.
> 

Thanks for your explaination. IIUC, bpf_sk_lookup was designed to skip 
connected socket lookup (established for TCP and connected for UDP), so 
it is not supposed to run before connected UDP lookup.
(though it seems so close to solve our problem...)

> It sounds like you are looking for an efficient way to lookup a
> connected UDP socket. We would be interested in that as well. We use> connected UDP/QUIC on egress where we don't expect the peer to roam and
> change its address. There's a memory cost on the kernel side to using
> them, but they make it easier to structure your application, because you
> can have roughly the same design for TCP and UDP transport.
> 
Yes, we have exactly the same problem.

> So what if instead of doing it in BPF, we make it better for everyone
> and introduce a hash table keyed by 4-tuple for connected sockets in the
> udp stack itself (counterpart of ehash in tcp)?

This solution is also ok to me. But I'm not sure are there previous 
attempts or technical problems on it?

In fact, I have done a simple test with 4-tuple UDP lookup, and it does 
make a difference:
(kernel-5.10, 1000 connected UDP socket on server, use sockperf to send 
msg to one of them, and take average for 5s)

Without 4-tuple lookup:

%Cpu0: 0.0 us, 0.0 sy, 0.0 ni,  0.0 id, 0.0 wa, 0.0 hi, 100.0 si, 0.0 st
%Cpu1: 0.2 us, 0.2 sy, 0.0 ni, 99.4 id, 0.0 wa, 0.2 hi,   0.0 si, 0.0 st
MiB Mem :7625.1 total,   6761.5 free,    210.2 used,    653.4 buff/cache
MiB Swap:   0.0 total,      0.0 free,      0.0 used.   7176.2 avail Mem

---
With 4-tuple lookup:

%Cpu0: 0.2 us, 0.4 sy, 0.0 ni, 48.1 id, 0.0 wa, 1.2 hi, 50.1 si,  0.0 st
%Cpu1: 0.6 us, 0.4 sy, 0.0 ni, 98.8 id, 0.0 wa, 0.2 hi,  0.0 si,  0.0 st
MiB Mem :7625.1 total,   6759.9 free,    211.9 used,    653.3 buff/cache
MiB Swap:   0.0 total,      0.0 free,      0.0 used.   7174.6 avail Mem

> 
> Thanks,
> (the other) Jakub

Thanks.
Jakub Sitnicki Aug. 22, 2024, 6:29 p.m. UTC | #3
On Wed, Aug 21, 2024 at 07:44 PM +08, Philo Lu wrote:
> On 2024/8/21 17:23, Jakub Sitnicki wrote:
>> Hi Philo,
>> [CC Eric and Paolo who have more context than me here.]
>> On Tue, Aug 20, 2024 at 08:31 PM +08, Philo Lu wrote:
>>> Hi all, I wonder if it is feasible to move BPF_SK_LOOKUP ahead of connected UDP
>>> sk lookup?
>>>
> ...
>>>
>>> So is there any other problem on it?Or I'll try to work on it and commit
>>> patches later.
>>>
>>> [0]https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
>>>
>>> Thank you for your time.
>> It was done like that to maintain the connected UDP socket guarantees.
>> Similarly to the established TCP sockets. The contract is that if you
>> are bound to a 4-tuple, you will receive the packets destined to it.
>> 
>
> Thanks for your explaination. IIUC, bpf_sk_lookup was designed to skip connected
> socket lookup (established for TCP and connected for UDP), so it is not supposed
> to run before connected UDP lookup.
> (though it seems so close to solve our problem...)

Yes, correct. Motivation behind bpf_sk_lookup was to steer TCP
connections & UDP flows to listening / unconnected sockets, like you can
do with TPROXY [1].

Since it had nothing to do with established / connected sockets, we
added the BPF hook in such a way that they are unaffected by it.

>> It sounds like you are looking for an efficient way to lookup a
>> connected UDP socket. We would be interested in that as well. We use> connected UDP/QUIC on egress where we don't expect the peer to roam and
>> change its address. There's a memory cost on the kernel side to using
>> them, but they make it easier to structure your application, because you
>> can have roughly the same design for TCP and UDP transport.
>> 
> Yes, we have exactly the same problem.

Good to know that there are other users of connected UDP out there.

Loosely related - I'm planning to raise the question if using connected
UDP sockets on ingress makes sense for QUIC at Plumbers [2].  Connected
UDP lookup performance is one of the aspects, here.

>> So what if instead of doing it in BPF, we make it better for everyone
>> and introduce a hash table keyed by 4-tuple for connected sockets in the
>> udp stack itself (counterpart of ehash in tcp)?
>
> This solution is also ok to me. But I'm not sure are there previous attempts or
> technical problems on it?
>
> In fact, I have done a simple test with 4-tuple UDP lookup, and it does make a
> difference:
> (kernel-5.10, 1000 connected UDP socket on server, use sockperf to send msg to
> one of them, and take average for 5s)
>
> Without 4-tuple lookup:
>
> %Cpu0: 0.0 us, 0.0 sy, 0.0 ni,  0.0 id, 0.0 wa, 0.0 hi, 100.0 si, 0.0 st
> %Cpu1: 0.2 us, 0.2 sy, 0.0 ni, 99.4 id, 0.0 wa, 0.2 hi,   0.0 si, 0.0 st
> MiB Mem :7625.1 total,   6761.5 free,    210.2 used,    653.4 buff/cache
> MiB Swap:   0.0 total,      0.0 free,      0.0 used.   7176.2 avail Mem
>
> ---
> With 4-tuple lookup:
>
> %Cpu0: 0.2 us, 0.4 sy, 0.0 ni, 48.1 id, 0.0 wa, 1.2 hi, 50.1 si,  0.0 st
> %Cpu1: 0.6 us, 0.4 sy, 0.0 ni, 98.8 id, 0.0 wa, 0.2 hi,  0.0 si,  0.0 st
> MiB Mem :7625.1 total,   6759.9 free,    211.9 used,    653.3 buff/cache
> MiB Swap:   0.0 total,      0.0 free,      0.0 used.   7174.6 avail Mem

Right. The overhead is expected. All server's connected sockets end up
in one hash bucket and we need to walk a long chain on lookup.

The workaround is not "pretty". You have configure your server to
receive on IP addresses and/or ports :-/

[1] Which also respects established / connected sockets, as long as they
    have_TRANSPARENT flag set.  Users need to set it "manually" for UDP.

[2] https://lpc.events/event/18/abstracts/2134/
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ddb86baaea6c8..9a1408775bcb1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -493,13 +493,6 @@  struct sock *__udp4_lib_lookup(const struct net 
*net, __be32 saddr,
         slot2 = hash2 & udptable->mask;
         hslot2 = &udptable->hash2[slot2];

-       /* Lookup connected or non-wildcard socket */
-       result = udp4_lib_lookup2(net, saddr, sport,
-                                 daddr, hnum, dif, sdif,
-                                 hslot2, skb);
-       if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
-               goto done;
-
         /* Lookup redirect from BPF */
         if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&