diff mbox series

[bpf,v2,1/2] bpf: Derive source IP addr via bpf_*_fib_lookup()

Message ID 20231003071013.824623-2-m@lambda.lt (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: Fix src IP addr related limitation in bpf_*_fib_lookup() | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf, async
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3570 this patch: 3570
netdev/cc_maintainers warning 14 maintainers not CCed: jolsa@kernel.org pabeni@redhat.com haoluo@google.com davem@davemloft.net ast@kernel.org sdf@google.com edumazet@google.com dsahern@kernel.org yonghong.song@linux.dev andrii@kernel.org john.fastabend@gmail.com kpsingh@kernel.org song@kernel.org kuba@kernel.org
netdev/build_clang success Errors and warnings before: 1579 this patch: 1579
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3801 this patch: 3801
netdev/checkpatch warning CHECK: Please don't use multiple blank lines WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-PR success PR summary
bpf/vmtest-bpf-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-15 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-11 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-7 success Logs for test_maps on s390x with gcc

Commit Message

Martynas Pumputis Oct. 3, 2023, 7:10 a.m. UTC
Extend the bpf_fib_lookup() helper by making it to return the source
IPv4/IPv6 address if the BPF_FIB_LOOKUP_SET_SRC flag is set.

For example, the following snippet can be used to derive the desired
source IP address:

    struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };

    ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
            BPF_FIB_LOOKUP_SET_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
    if (ret != BPF_FIB_LKUP_RET_SUCCESS)
        return TC_ACT_SHOT;

    /* the p.ipv4_src now contains the source address */

The inability to derive the proper source address may cause malfunctions
in BPF-based dataplanes for hosts containing netdevs with more than one
routable IP address or for multi-homed hosts.

For example, Cilium implements packet masquerading in BPF. If an
egressing netdev to which the Cilium's BPF prog is attached has
multiple IP addresses, then only one [hardcoded] IP address can be used for
masquerading. This breaks connectivity if any other IP address should have
been selected instead, for example, when a public and private addresses
are attached to the same egress interface.

The change was tested with Cilium [1].

Nikolay Aleksandrov helped to figure out the IPv6 addr selection.

[1]: https://github.com/cilium/cilium/pull/28283

Signed-off-by: Martynas Pumputis <m@lambda.lt>
---
 include/net/ipv6_stubs.h       |  5 +++++
 include/uapi/linux/bpf.h       |  9 +++++++++
 net/core/filter.c              | 19 ++++++++++++++++++-
 net/ipv6/af_inet6.c            |  1 +
 tools/include/uapi/linux/bpf.h | 10 ++++++++++
 5 files changed, 43 insertions(+), 1 deletion(-)

Comments

Martin KaFai Lau Oct. 4, 2023, 8:09 p.m. UTC | #1
On 10/3/23 12:10 AM, Martynas Pumputis wrote:
> Extend the bpf_fib_lookup() helper by making it to return the source
> IPv4/IPv6 address if the BPF_FIB_LOOKUP_SET_SRC flag is set.
> 
> For example, the following snippet can be used to derive the desired
> source IP address:
> 
>      struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };
> 
>      ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
>              BPF_FIB_LOOKUP_SET_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
>      if (ret != BPF_FIB_LKUP_RET_SUCCESS)
>          return TC_ACT_SHOT;
> 
>      /* the p.ipv4_src now contains the source address */
> 
> The inability to derive the proper source address may cause malfunctions
> in BPF-based dataplanes for hosts containing netdevs with more than one
> routable IP address or for multi-homed hosts.
> 
> For example, Cilium implements packet masquerading in BPF. If an
> egressing netdev to which the Cilium's BPF prog is attached has
> multiple IP addresses, then only one [hardcoded] IP address can be used for
> masquerading. This breaks connectivity if any other IP address should have
> been selected instead, for example, when a public and private addresses
> are attached to the same egress interface.
> 
> The change was tested with Cilium [1].
> 
> Nikolay Aleksandrov helped to figure out the IPv6 addr selection.
> 
> [1]: https://github.com/cilium/cilium/pull/28283
> 
> Signed-off-by: Martynas Pumputis <m@lambda.lt>
> ---
>   include/net/ipv6_stubs.h       |  5 +++++
>   include/uapi/linux/bpf.h       |  9 +++++++++
>   net/core/filter.c              | 19 ++++++++++++++++++-
>   net/ipv6/af_inet6.c            |  1 +
>   tools/include/uapi/linux/bpf.h | 10 ++++++++++
>   5 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h
> index c48186bf4737..21da31e1dff5 100644
> --- a/include/net/ipv6_stubs.h
> +++ b/include/net/ipv6_stubs.h
> @@ -85,6 +85,11 @@ struct ipv6_bpf_stub {
>   			       sockptr_t optval, unsigned int optlen);
>   	int (*ipv6_getsockopt)(struct sock *sk, int level, int optname,
>   			       sockptr_t optval, sockptr_t optlen);
> +	int (*ipv6_dev_get_saddr)(struct net *net,
> +				  const struct net_device *dst_dev,
> +				  const struct in6_addr *daddr,
> +				  unsigned int prefs,
> +				  struct in6_addr *saddr);
>   };
>   extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
>   
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0448700890f7..a6bf686eecbc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3257,6 +3257,10 @@ union bpf_attr {
>    *			and *params*->smac will not be set as output. A common
>    *			use case is to call **bpf_redirect_neigh**\ () after
>    *			doing **bpf_fib_lookup**\ ().
> + *		**BPF_FIB_LOOKUP_SET_SRC**
> + *			Derive and set source IP addr in *params*->ipv{4,6}_src
> + *			for the nexthop. If the src addr cannot be derived,
> + *			**BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned.
>    *
>    *		*ctx* is either **struct xdp_md** for XDP programs or
>    *		**struct sk_buff** tc cls_act programs.
> @@ -6953,6 +6957,7 @@ enum {
>   	BPF_FIB_LOOKUP_OUTPUT  = (1U << 1),
>   	BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2),
>   	BPF_FIB_LOOKUP_TBID    = (1U << 3),
> +	BPF_FIB_LOOKUP_SET_SRC = (1U << 4),

bikeshedding: Shorten it to BPF_FIB_LOOKUP_SRC?

>   };
>   
>   enum {
> @@ -6965,6 +6970,7 @@ enum {
>   	BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires encapsulation */
>   	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
>   	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> +	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
>   };
>   
>   struct bpf_fib_lookup {
> @@ -6999,6 +7005,9 @@ struct bpf_fib_lookup {
>   		__u32	rt_metric;
>   	};
>   
> +	/* input: source address to consider for lookup
> +	 * output: source address result from lookup
> +	 */
>   	union {
>   		__be32		ipv4_src;
>   		__u32		ipv6_src[4];  /* in6_addr; network order */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a094694899c9..f3777ef1840b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5850,6 +5850,9 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>   	params->rt_metric = res.fi->fib_priority;
>   	params->ifindex = dev->ifindex;
>   
> +	if (flags & BPF_FIB_LOOKUP_SET_SRC)
> +		params->ipv4_src = fib_result_prefsrc(net, &res);

Does it need to check 0 and return BPF_FIB_LKUP_RET_NO_SRC_ADDR for v4 also,
or the fib_result_prefsrc does not fail?

> +
>   	/* xdp and cls_bpf programs are run in RCU-bh so
>   	 * rcu_read_lock_bh is not needed here
>   	 */
> @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>   	params->rt_metric = res.f6i->fib6_metric;
>   	params->ifindex = dev->ifindex;
>   
> +	if (flags & BPF_FIB_LOOKUP_SET_SRC) {
> +		if (res.f6i->fib6_prefsrc.plen) {
> +			*(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr;
> +		} else {
> +			err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev,
> +								&fl6.daddr, 0,
> +								(struct in6_addr *)
> +								params->ipv6_src);
> +			if (err)
> +				return BPF_FIB_LKUP_RET_NO_SRC_ADDR;

This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of 
improving the API. May be others have some ideas.

Considering dev has no saddr is probably (?) an unlikely case, it should be ok 
to leave it as is but at least a comment in the uapi will be needed. Otherwise, 
the bpf prog may use the 0 dmac as-is.

I feel the current bpf_ipv[46]_fib_lookup helper is doing many things in one 
function and then requires different BPF_FIB_LOOKUP_* bits to select what/how to 
do. In the future, it may be worth to consider breaking it into smaller 
kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc.

> +		}
> +	}
> +
>   	if (flags & BPF_FIB_LOOKUP_SKIP_NEIGH)
>   		goto set_fwd_params;
>   
> @@ -6010,7 +6026,8 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>   #endif
>   
>   #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
> -			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID)
> +			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
> +			     BPF_FIB_LOOKUP_SET_SRC)
>   
>   BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
>   	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
Martynas Pumputis Oct. 5, 2023, 8:16 p.m. UTC | #2
Thanks for the review.

On Wed, Oct 4, 2023, at 10:09 PM, Martin KaFai Lau wrote:
> On 10/3/23 12:10 AM, Martynas Pumputis wrote:
>> Extend the bpf_fib_lookup() helper by making it to return the source
>> IPv4/IPv6 address if the BPF_FIB_LOOKUP_SET_SRC flag is set.
>> 
>> For example, the following snippet can be used to derive the desired
>> source IP address:
>> 
>>      struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };
>> 
>>      ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
>>              BPF_FIB_LOOKUP_SET_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
>>      if (ret != BPF_FIB_LKUP_RET_SUCCESS)
>>          return TC_ACT_SHOT;
>> 
>>      /* the p.ipv4_src now contains the source address */
>> 
>> The inability to derive the proper source address may cause malfunctions
>> in BPF-based dataplanes for hosts containing netdevs with more than one
>> routable IP address or for multi-homed hosts.
>> 
>> For example, Cilium implements packet masquerading in BPF. If an
>> egressing netdev to which the Cilium's BPF prog is attached has
>> multiple IP addresses, then only one [hardcoded] IP address can be used for
>> masquerading. This breaks connectivity if any other IP address should have
>> been selected instead, for example, when a public and private addresses
>> are attached to the same egress interface.
>> 
>> The change was tested with Cilium [1].
>> 
>> Nikolay Aleksandrov helped to figure out the IPv6 addr selection.
>> 
>> [1]: https://github.com/cilium/cilium/pull/28283
>> 
>> Signed-off-by: Martynas Pumputis <m@lambda.lt>
>> ---
>>   include/net/ipv6_stubs.h       |  5 +++++
>>   include/uapi/linux/bpf.h       |  9 +++++++++
>>   net/core/filter.c              | 19 ++++++++++++++++++-
>>   net/ipv6/af_inet6.c            |  1 +
>>   tools/include/uapi/linux/bpf.h | 10 ++++++++++
>>   5 files changed, 43 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h
>> index c48186bf4737..21da31e1dff5 100644
>> --- a/include/net/ipv6_stubs.h
>> +++ b/include/net/ipv6_stubs.h
>> @@ -85,6 +85,11 @@ struct ipv6_bpf_stub {
>>   			       sockptr_t optval, unsigned int optlen);
>>   	int (*ipv6_getsockopt)(struct sock *sk, int level, int optname,
>>   			       sockptr_t optval, sockptr_t optlen);
>> +	int (*ipv6_dev_get_saddr)(struct net *net,
>> +				  const struct net_device *dst_dev,
>> +				  const struct in6_addr *daddr,
>> +				  unsigned int prefs,
>> +				  struct in6_addr *saddr);
>>   };
>>   extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
>>   
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0448700890f7..a6bf686eecbc 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -3257,6 +3257,10 @@ union bpf_attr {
>>    *			and *params*->smac will not be set as output. A common
>>    *			use case is to call **bpf_redirect_neigh**\ () after
>>    *			doing **bpf_fib_lookup**\ ().
>> + *		**BPF_FIB_LOOKUP_SET_SRC**
>> + *			Derive and set source IP addr in *params*->ipv{4,6}_src
>> + *			for the nexthop. If the src addr cannot be derived,
>> + *			**BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned.
>>    *
>>    *		*ctx* is either **struct xdp_md** for XDP programs or
>>    *		**struct sk_buff** tc cls_act programs.
>> @@ -6953,6 +6957,7 @@ enum {
>>   	BPF_FIB_LOOKUP_OUTPUT  = (1U << 1),
>>   	BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2),
>>   	BPF_FIB_LOOKUP_TBID    = (1U << 3),
>> +	BPF_FIB_LOOKUP_SET_SRC = (1U << 4),
>
> bikeshedding: Shorten it to BPF_FIB_LOOKUP_SRC?

SGTM.

>
>>   };
>>   
>>   enum {
>> @@ -6965,6 +6970,7 @@ enum {
>>   	BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires encapsulation */
>>   	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
>>   	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
>> +	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
>>   };
>>   
>>   struct bpf_fib_lookup {
>> @@ -6999,6 +7005,9 @@ struct bpf_fib_lookup {
>>   		__u32	rt_metric;
>>   	};
>>   
>> +	/* input: source address to consider for lookup
>> +	 * output: source address result from lookup
>> +	 */
>>   	union {
>>   		__be32		ipv4_src;
>>   		__u32		ipv6_src[4];  /* in6_addr; network order */
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index a094694899c9..f3777ef1840b 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -5850,6 +5850,9 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>>   	params->rt_metric = res.fi->fib_priority;
>>   	params->ifindex = dev->ifindex;
>>   
>> +	if (flags & BPF_FIB_LOOKUP_SET_SRC)
>> +		params->ipv4_src = fib_result_prefsrc(net, &res);
>
> Does it need to check 0 and return BPF_FIB_LKUP_RET_NO_SRC_ADDR for v4 also,
> or the fib_result_prefsrc does not fail?

From inspecting the other usages of the function - it does not fail.

>
>> +
>>   	/* xdp and cls_bpf programs are run in RCU-bh so
>>   	 * rcu_read_lock_bh is not needed here
>>   	 */
>> @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>>   	params->rt_metric = res.f6i->fib6_metric;
>>   	params->ifindex = dev->ifindex;
>>   
>> +	if (flags & BPF_FIB_LOOKUP_SET_SRC) {
>> +		if (res.f6i->fib6_prefsrc.plen) {
>> +			*(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr;
>> +		} else {
>> +			err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev,
>> +								&fl6.daddr, 0,
>> +								(struct in6_addr *)
>> +								params->ipv6_src);
>> +			if (err)
>> +				return BPF_FIB_LKUP_RET_NO_SRC_ADDR;
>
> This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of 
> improving the API. May be others have some ideas.
>
> Considering dev has no saddr is probably (?) an unlikely case, it should be ok 
> to leave it as is but at least a comment in the uapi will be needed. Otherwise, 
> the bpf prog may use the 0 dmac as-is.

I expect that a user of the helper checks that err == 0 before using any of the output params.

>
> I feel the current bpf_ipv[46]_fib_lookup helper is doing many things 
> in one 
> function and then requires different BPF_FIB_LOOKUP_* bits to select 
> what/how to 
> do. In the future, it may be worth to consider breaking it into smaller 
> kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc.
>

Yep, good idea. At least it seems that the neigh lookup could live in its own function.

>> +		}
>> +	}
>> +
>>   	if (flags & BPF_FIB_LOOKUP_SKIP_NEIGH)
>>   		goto set_fwd_params;
>>   
>> @@ -6010,7 +6026,8 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>>   #endif
>>   
>>   #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
>> -			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID)
>> +			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
>> +			     BPF_FIB_LOOKUP_SET_SRC)
>>   
>>   BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
>>   	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
Martin KaFai Lau Oct. 6, 2023, 6:29 a.m. UTC | #3
On 10/5/23 1:16 PM, Martynas wrote:
>>> @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>>>    	params->rt_metric = res.f6i->fib6_metric;
>>>    	params->ifindex = dev->ifindex;
>>>    
>>> +	if (flags & BPF_FIB_LOOKUP_SET_SRC) {
>>> +		if (res.f6i->fib6_prefsrc.plen) {
>>> +			*(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr;

A nit. just noticed. Similar to the "*dst" assignment a few lines above:

			*src = res.f6i->fib6_prefsrc.addr;

>>> +		} else {
>>> +			err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev,
>>> +								&fl6.daddr, 0,
>>> +								(struct in6_addr *)
>>> +								params->ipv6_src);

Same here. Use the "src".

>>> +			if (err)
>>> +				return BPF_FIB_LKUP_RET_NO_SRC_ADDR;
>>
>> This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of
>> improving the API. May be others have some ideas.
>>
>> Considering dev has no saddr is probably (?) an unlikely case, it should be ok
>> to leave it as is but at least a comment in the uapi will be needed. Otherwise,
>> the bpf prog may use the 0 dmac as-is.
> 
> I expect that a user of the helper checks that err == 0 before using any of the output params.

For example, the bpf prog gets BPF_FIB_LKUP_RET_NO_NEIGH and learns neigh is not 
available but ipv6_dst (and the optional ipv6_src) is still valid.

If the bpf prog gets BPF_FIB_LKUP_RET_NO_SRC_ADDR, intuitively, only ipv6_src is 
not available. The bpf prog will continue to use the ipv6_dst and dmac (which is 
actually 0).

> 
>>
>> I feel the current bpf_ipv[46]_fib_lookup helper is doing many things
>> in one
>> function and then requires different BPF_FIB_LOOKUP_* bits to select
>> what/how to
>> do. In the future, it may be worth to consider breaking it into smaller
>> kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc.
>>
> 
> Yep, good idea. At least it seems that the neigh lookup could live in its own function.

To be clear, it could be independent of this set.

Thanks.
Martynas Pumputis Oct. 6, 2023, 7:03 a.m. UTC | #4
On Fri, Oct 6, 2023, at 8:29 AM, Martin KaFai Lau wrote:
> On 10/5/23 1:16 PM, Martynas wrote:
>>>> @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>>>>    	params->rt_metric = res.f6i->fib6_metric;
>>>>    	params->ifindex = dev->ifindex;
>>>>    
>>>> +	if (flags & BPF_FIB_LOOKUP_SET_SRC) {
>>>> +		if (res.f6i->fib6_prefsrc.plen) {
>>>> +			*(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr;
>
> A nit. just noticed. Similar to the "*dst" assignment a few lines above:

SGTM.

>
> 			*src = res.f6i->fib6_prefsrc.addr;
>
>>>> +		} else {
>>>> +			err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev,
>>>> +								&fl6.daddr, 0,
>>>> +								(struct in6_addr *)
>>>> +								params->ipv6_src);
>
> Same here. Use the "src".
>
>>>> +			if (err)
>>>> +				return BPF_FIB_LKUP_RET_NO_SRC_ADDR;
>>>
>>> This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of
>>> improving the API. May be others have some ideas.
>>>
>>> Considering dev has no saddr is probably (?) an unlikely case, it should be ok
>>> to leave it as is but at least a comment in the uapi will be needed. Otherwise,
>>> the bpf prog may use the 0 dmac as-is.
>> 
>> I expect that a user of the helper checks that err == 0 before using any of the output params.
>
> For example, the bpf prog gets BPF_FIB_LKUP_RET_NO_NEIGH and learns 
> neigh is not 
> available but ipv6_dst (and the optional ipv6_src) is still valid.
>
> If the bpf prog gets BPF_FIB_LKUP_RET_NO_SRC_ADDR, intuitively, only 
> ipv6_src is 
> not available. The bpf prog will continue to use the ipv6_dst and dmac 
> (which is 
> actually 0).
>

Thinking out loud, we could make BPF_FIB_LOOKUP_SRC to require BPF_FIB_LOOKUP_SKIP_NEIGH, but then for some cases a user would be required to call the helper twice. This is a no-go due perf and instruction count reasons.

Nothing betters comes than explicitly documenting the behavior in the uapi comments.

>> 
>>>
>>> I feel the current bpf_ipv[46]_fib_lookup helper is doing many things
>>> in one
>>> function and then requires different BPF_FIB_LOOKUP_* bits to select
>>> what/how to
>>> do. In the future, it may be worth to consider breaking it into smaller
>>> kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc.
>>>
>> 
>> Yep, good idea. At least it seems that the neigh lookup could live in its own function.
>
> To be clear, it could be independent of this set.
>
> Thanks.
diff mbox series

Patch

diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h
index c48186bf4737..21da31e1dff5 100644
--- a/include/net/ipv6_stubs.h
+++ b/include/net/ipv6_stubs.h
@@ -85,6 +85,11 @@  struct ipv6_bpf_stub {
 			       sockptr_t optval, unsigned int optlen);
 	int (*ipv6_getsockopt)(struct sock *sk, int level, int optname,
 			       sockptr_t optval, sockptr_t optlen);
+	int (*ipv6_dev_get_saddr)(struct net *net,
+				  const struct net_device *dst_dev,
+				  const struct in6_addr *daddr,
+				  unsigned int prefs,
+				  struct in6_addr *saddr);
 };
 extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0448700890f7..a6bf686eecbc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3257,6 +3257,10 @@  union bpf_attr {
  *			and *params*->smac will not be set as output. A common
  *			use case is to call **bpf_redirect_neigh**\ () after
  *			doing **bpf_fib_lookup**\ ().
+ *		**BPF_FIB_LOOKUP_SET_SRC**
+ *			Derive and set source IP addr in *params*->ipv{4,6}_src
+ *			for the nexthop. If the src addr cannot be derived,
+ *			**BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -6953,6 +6957,7 @@  enum {
 	BPF_FIB_LOOKUP_OUTPUT  = (1U << 1),
 	BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2),
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
+	BPF_FIB_LOOKUP_SET_SRC = (1U << 4),
 };
 
 enum {
@@ -6965,6 +6970,7 @@  enum {
 	BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires encapsulation */
 	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
 	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
 };
 
 struct bpf_fib_lookup {
@@ -6999,6 +7005,9 @@  struct bpf_fib_lookup {
 		__u32	rt_metric;
 	};
 
+	/* input: source address to consider for lookup
+	 * output: source address result from lookup
+	 */
 	union {
 		__be32		ipv4_src;
 		__u32		ipv6_src[4];  /* in6_addr; network order */
diff --git a/net/core/filter.c b/net/core/filter.c
index a094694899c9..f3777ef1840b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5850,6 +5850,9 @@  static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	params->rt_metric = res.fi->fib_priority;
 	params->ifindex = dev->ifindex;
 
+	if (flags & BPF_FIB_LOOKUP_SET_SRC)
+		params->ipv4_src = fib_result_prefsrc(net, &res);
+
 	/* xdp and cls_bpf programs are run in RCU-bh so
 	 * rcu_read_lock_bh is not needed here
 	 */
@@ -5992,6 +5995,19 @@  static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	params->rt_metric = res.f6i->fib6_metric;
 	params->ifindex = dev->ifindex;
 
+	if (flags & BPF_FIB_LOOKUP_SET_SRC) {
+		if (res.f6i->fib6_prefsrc.plen) {
+			*(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr;
+		} else {
+			err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev,
+								&fl6.daddr, 0,
+								(struct in6_addr *)
+								params->ipv6_src);
+			if (err)
+				return BPF_FIB_LKUP_RET_NO_SRC_ADDR;
+		}
+	}
+
 	if (flags & BPF_FIB_LOOKUP_SKIP_NEIGH)
 		goto set_fwd_params;
 
@@ -6010,7 +6026,8 @@  static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 #endif
 
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
-			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID)
+			     BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
+			     BPF_FIB_LOOKUP_SET_SRC)
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 368824fe9719..5382c6543d46 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1060,6 +1060,7 @@  static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
 	.udp6_lib_lookup = __udp6_lib_lookup,
 	.ipv6_setsockopt = do_ipv6_setsockopt,
 	.ipv6_getsockopt = do_ipv6_getsockopt,
+	.ipv6_dev_get_saddr = ipv6_dev_get_saddr,
 };
 
 static int __init inet6_init(void)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0448700890f7..72cd0ca97689 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3257,6 +3257,10 @@  union bpf_attr {
  *			and *params*->smac will not be set as output. A common
  *			use case is to call **bpf_redirect_neigh**\ () after
  *			doing **bpf_fib_lookup**\ ().
+ *		**BPF_FIB_LOOKUP_SET_SRC**
+ *			Derive and set source IP addr in *params*->ipv{4,6}_src
+ *			for the nexthop. If the src addr cannot be derived,
+ *			**BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned.
  *
  *		*ctx* is either **struct xdp_md** for XDP programs or
  *		**struct sk_buff** tc cls_act programs.
@@ -6953,6 +6957,7 @@  enum {
 	BPF_FIB_LOOKUP_OUTPUT  = (1U << 1),
 	BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2),
 	BPF_FIB_LOOKUP_TBID    = (1U << 3),
+	BPF_FIB_LOOKUP_SET_SRC = (1U << 4),
 };
 
 enum {
@@ -6965,6 +6970,7 @@  enum {
 	BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires encapsulation */
 	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
 	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_FIB_LKUP_RET_NO_SRC_ADDR,  /* failed to derive IP src addr */
 };
 
 struct bpf_fib_lookup {
@@ -6999,6 +7005,10 @@  struct bpf_fib_lookup {
 		__u32	rt_metric;
 	};
 
+
+	/* input: source address to consider for lookup
+	 * output: source address result from lookup
+	 */
 	union {
 		__be32		ipv4_src;
 		__u32		ipv6_src[4];  /* in6_addr; network order */