[bpf,v2,0/4] Socket lookup BPF API from tc/xdp ingress does not respect VRF bindings.

Message ID	20230420145041.508434-1-gilad9366@gmail.com (mailing list archive)
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Gilad Sever <gilad9366@gmail.com> To: dsahern@kernel.org, martin.lau@linux.dev, daniel@iogearbox.net, john.fastabend@gmail.com, ast@kernel.org, andrii@kernel.org, song@kernel.org, yhs@fb.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, mykolal@fb.com, shuah@kernel.org, hawk@kernel.org, joe@wand.net.nz Cc: eyal.birger@gmail.com, shmulik.ladkani@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org, linux-kselftest@vger.kernel.org, Gilad Sever <gilad9366@gmail.com> Subject: [PATCH bpf,v2 0/4] Socket lookup BPF API from tc/xdp ingress does not respect VRF bindings. Date: Thu, 20 Apr 2023 17:50:37 +0300 Message-Id: <20230420145041.508434-1-gilad9366@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Socket lookup BPF API from tc/xdp ingress does not respect VRF bindings. \| expand [bpf,v2,0/4] Socket lookup BPF API from tc/xdp ingress does not respect VRF bindings. [bpf,v2,1/4] bpf: factor out socket lookup functions for the TC hookpoint. [bpf,v2,2/4] bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint [bpf,v2,3/4] bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings [bpf,v2,4/4] selftests/bpf: Add tc_socket_lookup tests

Message ID

20230420145041.508434-1-gilad9366@gmail.com (mailing list archive)

Headers

From: Gilad Sever <gilad9366@gmail.com>
To: dsahern@kernel.org, martin.lau@linux.dev, daniel@iogearbox.net,
        john.fastabend@gmail.com, ast@kernel.org, andrii@kernel.org,
        song@kernel.org, yhs@fb.com, kpsingh@kernel.org, sdf@google.com,
        haoluo@google.com, jolsa@kernel.org, davem@davemloft.net,
        edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
        mykolal@fb.com, shuah@kernel.org, hawk@kernel.org, joe@wand.net.nz
Cc: eyal.birger@gmail.com, shmulik.ladkani@gmail.com,
        bpf@vger.kernel.org, netdev@vger.kernel.org,
        linux-kselftest@vger.kernel.org, Gilad Sever <gilad9366@gmail.com>
Subject: [PATCH bpf,v2 0/4] Socket lookup BPF API from tc/xdp ingress does not
 respect VRF bindings.
Date: Thu, 20 Apr 2023 17:50:37 +0300
Message-Id: <20230420145041.508434-1-gilad9366@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Socket lookup BPF API from tc/xdp ingress does not respect VRF bindings. | expand

Message

Gilad Sever April 20, 2023, 2:50 p.m. UTC

When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
respected. This patchset fixes this by regarding the incoming device's
VRF attachment when performing the socket lookups from tc/xdp.

The first two patches are coding changes which facilitate this fix by
factoring out the tc helper's logic which was shared with cg/sk_skb
(which operate correctly).

The third patch contains the actual bugfix.

The fourth patch adds bpf tests for these lookup functions.
---
v2: Fixed uninitialized var in test patch (4).

Gilad Sever (4):
  bpf: factor out socket lookup functions for the TC hookpoint.
  bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC
    hookpoint
  bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings
  selftests/bpf: Add tc_socket_lookup tests

 net/core/filter.c                             | 132 +++++--
 .../bpf/prog_tests/tc_socket_lookup.c         | 341 ++++++++++++++++++
 .../selftests/bpf/progs/tc_socket_lookup.c    |  73 ++++
 3 files changed, 525 insertions(+), 21 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_socket_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/tc_socket_lookup.c

Comments

Stanislav Fomichev April 20, 2023, 4:42 p.m. UTC | #1

On 04/20, Gilad Sever wrote:
> When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
> respected. This patchset fixes this by regarding the incoming device's
> VRF attachment when performing the socket lookups from tc/xdp.
> 
> The first two patches are coding changes which facilitate this fix by
> factoring out the tc helper's logic which was shared with cg/sk_skb
> (which operate correctly).

Why is not relevant for cgroup/egress? Is it already running with
the correct device?

Also, do we really need all this refactoring and separate paths?
Can we just add that bpf_l2_sdif part to the existing code?
It will trigger for tc, but I'm assuming it will be a no-op for cgroup
path?

And regarding bpf_l2_sdif: seems like it's really generic and should
probably be called something like dev_sdif?

> The third patch contains the actual bugfix.
> 
> The fourth patch adds bpf tests for these lookup functions.
> ---
> v2: Fixed uninitialized var in test patch (4).
> 
> Gilad Sever (4):
>   bpf: factor out socket lookup functions for the TC hookpoint.
>   bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC
>     hookpoint
>   bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings
>   selftests/bpf: Add tc_socket_lookup tests
> 
>  net/core/filter.c                             | 132 +++++--
>  .../bpf/prog_tests/tc_socket_lookup.c         | 341 ++++++++++++++++++
>  .../selftests/bpf/progs/tc_socket_lookup.c    |  73 ++++
>  3 files changed, 525 insertions(+), 21 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_socket_lookup.c
>  create mode 100644 tools/testing/selftests/bpf/progs/tc_socket_lookup.c
> 
> -- 
> 2.34.1
>

Gilad Sever April 23, 2023, 11:41 a.m. UTC | #2

On 20/04/2023 19:42, Stanislav Fomichev wrote:
> On 04/20, Gilad Sever wrote:
>> When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
>> respected. This patchset fixes this by regarding the incoming device's
>> VRF attachment when performing the socket lookups from tc/xdp.
>>
>> The first two patches are coding changes which facilitate this fix by
>> factoring out the tc helper's logic which was shared with cg/sk_skb
>> (which operate correctly).
> Why is not relevant for cgroup/egress? Is it already running with
> the correct device?
Yes.
>
> Also, do we really need all this refactoring and separate paths?
> Can we just add that bpf_l2_sdif part to the existing code?
> It will trigger for tc, but I'm assuming it will be a no-op for cgroup
> path?
The reason we preferred the refactoring is to avoid triggering `inet_sdif()`
from tc/xdp. This is because in our understanding, the IPCB is undefined 
before
IP processing so it seems incorrect to use `inet_sdif()` from tc/xdp.

We did come up with a different option which could spare most of the 
refactoring
and still partially separate the two paths:

Pass sdif to __bpf_skc_lookup() but instead of using different functions
for tc, calculate sdif by calling `dev_sdif()` in bpf_skc_lookup() only when
netif_is_l3_master() is false. In other words:

- xdp callers would check the device's l3 enslaved state using the new 
`dev_sdif()`
- sock_addr callers would use inet{,6}_sdif() as they did before
- cg/tc share the same code path, so when netif_is_l3_master() is true
   use inet{,6}_sdif() and when it is false use dev_sdif(). this relies 
on the following
   assumptions:
   - tc programs don't run on l3 master devices
   - cgroup callers never see l3 enslaved devices
   - inet{,6}_sdif() isn't relevant for non l3 master devices

In our opinion, it's safer to factor out the tc flow as in the patchset, 
similar to XDP
which has its own functions.

What do you think?
> And regarding bpf_l2_sdif: seems like it's really generic and should
> probably be called something like dev_sdif?
Agreed. I'll rename in the next patch.
>
>> The third patch contains the actual bugfix.
>>
>> The fourth patch adds bpf tests for these lookup functions.
>> ---
>> v2: Fixed uninitialized var in test patch (4).
>>
>> Gilad Sever (4):
>>    bpf: factor out socket lookup functions for the TC hookpoint.
>>    bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC
>>      hookpoint
>>    bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings
>>    selftests/bpf: Add tc_socket_lookup tests
>>
>>   net/core/filter.c                             | 132 +++++--
>>   .../bpf/prog_tests/tc_socket_lookup.c         | 341 ++++++++++++++++++
>>   .../selftests/bpf/progs/tc_socket_lookup.c    |  73 ++++
>>   3 files changed, 525 insertions(+), 21 deletions(-)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_socket_lookup.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/tc_socket_lookup.c
>>
>> -- 
>> 2.34.1
>>

Stanislav Fomichev April 24, 2023, 5:06 p.m. UTC | #3

On Sun, Apr 23, 2023 at 4:41 AM Gilad Sever <gilad9366@gmail.com> wrote:
>
>
> On 20/04/2023 19:42, Stanislav Fomichev wrote:
> > On 04/20, Gilad Sever wrote:
> >> When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
> >> respected. This patchset fixes this by regarding the incoming device's
> >> VRF attachment when performing the socket lookups from tc/xdp.
> >>
> >> The first two patches are coding changes which facilitate this fix by
> >> factoring out the tc helper's logic which was shared with cg/sk_skb
> >> (which operate correctly).
> > Why is not relevant for cgroup/egress? Is it already running with
> > the correct device?
> Yes.
> >
> > Also, do we really need all this refactoring and separate paths?
> > Can we just add that bpf_l2_sdif part to the existing code?
> > It will trigger for tc, but I'm assuming it will be a no-op for cgroup
> > path?
> The reason we preferred the refactoring is to avoid triggering `inet_sdif()`
> from tc/xdp. This is because in our understanding, the IPCB is undefined
> before
> IP processing so it seems incorrect to use `inet_sdif()` from tc/xdp.
>
> We did come up with a different option which could spare most of the
> refactoring
> and still partially separate the two paths:
>
> Pass sdif to __bpf_skc_lookup() but instead of using different functions
> for tc, calculate sdif by calling `dev_sdif()` in bpf_skc_lookup() only when
> netif_is_l3_master() is false. In other words:

[..]

> - xdp callers would check the device's l3 enslaved state using the new
> `dev_sdif()`
> - sock_addr callers would use inet{,6}_sdif() as they did before
> - cg/tc share the same code path, so when netif_is_l3_master() is true
>    use inet{,6}_sdif() and when it is false use dev_sdif(). this relies
> on the following
>    assumptions:
>    - tc programs don't run on l3 master devices
>    - cgroup callers never see l3 enslaved devices
>    - inet{,6}_sdif() isn't relevant for non l3 master devices

Yeah, that's what I was assuming we should be able to do..
But we probably need somebody who understands this part better than me
to say whether the above are safe..

If nobody comments, ignore me and do a v2 with your original approach.

David Ahern April 25, 2023, 1:39 a.m. UTC | #4

On 4/24/23 11:06 AM, Stanislav Fomichev wrote:
>> - xdp callers would check the device's l3 enslaved state using the new
>> `dev_sdif()`
>> - sock_addr callers would use inet{,6}_sdif() as they did before
>> - cg/tc share the same code path, so when netif_is_l3_master() is true
>>    use inet{,6}_sdif() and when it is false use dev_sdif(). this relies
>> on the following
>>    assumptions:
>>    - tc programs don't run on l3 master devices

this can happen, but I am not sure how prevalent a use case.

>>    - cgroup callers never see l3 enslaved devices

egress definitely, not sure on ingress. The code resets the skb->dev
back to the original device in a lot of places in the ip/ipv6 code now.
And ipv6 brings up LLAs and those did not get the device switch so it
could be fairly common.

>>    - inet{,6}_sdif() isn't relevant for non l3 master devices

sdif should be 0 and not matched if a netdev is not a l3mdev port.

BTW, in skimming the patches, I noticed patch 3 has bpf_l2_sdif which
seems an odd name to me. It returns a layer 3 device index, not a layer
2 which would be a bridge port. I would stick to the l3 naming for
consistency.

> 
> Yeah, that's what I was assuming we should be able to do..
> But we probably need somebody who understands this part better than me
> to say whether the above are safe..
> 
> If nobody comments, ignore me and do a v2 with your original approach.