Message ID | 20231003071013.824623-2-m@lambda.lt (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Fix src IP addr related limitation in bpf_*_fib_lookup() | expand |
On 10/3/23 12:10 AM, Martynas Pumputis wrote: > Extend the bpf_fib_lookup() helper by making it to return the source > IPv4/IPv6 address if the BPF_FIB_LOOKUP_SET_SRC flag is set. > > For example, the following snippet can be used to derive the desired > source IP address: > > struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr }; > > ret = bpf_skb_fib_lookup(skb, p, sizeof(p), > BPF_FIB_LOOKUP_SET_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH); > if (ret != BPF_FIB_LKUP_RET_SUCCESS) > return TC_ACT_SHOT; > > /* the p.ipv4_src now contains the source address */ > > The inability to derive the proper source address may cause malfunctions > in BPF-based dataplanes for hosts containing netdevs with more than one > routable IP address or for multi-homed hosts. > > For example, Cilium implements packet masquerading in BPF. If an > egressing netdev to which the Cilium's BPF prog is attached has > multiple IP addresses, then only one [hardcoded] IP address can be used for > masquerading. This breaks connectivity if any other IP address should have > been selected instead, for example, when a public and private addresses > are attached to the same egress interface. > > The change was tested with Cilium [1]. > > Nikolay Aleksandrov helped to figure out the IPv6 addr selection. > > [1]: https://github.com/cilium/cilium/pull/28283 > > Signed-off-by: Martynas Pumputis <m@lambda.lt> > --- > include/net/ipv6_stubs.h | 5 +++++ > include/uapi/linux/bpf.h | 9 +++++++++ > net/core/filter.c | 19 ++++++++++++++++++- > net/ipv6/af_inet6.c | 1 + > tools/include/uapi/linux/bpf.h | 10 ++++++++++ > 5 files changed, 43 insertions(+), 1 deletion(-) > > diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h > index c48186bf4737..21da31e1dff5 100644 > --- a/include/net/ipv6_stubs.h > +++ b/include/net/ipv6_stubs.h > @@ -85,6 +85,11 @@ struct ipv6_bpf_stub { > sockptr_t optval, unsigned int optlen); > int (*ipv6_getsockopt)(struct sock *sk, int level, int optname, > sockptr_t optval, sockptr_t optlen); > + int (*ipv6_dev_get_saddr)(struct net *net, > + const struct net_device *dst_dev, > + const struct in6_addr *daddr, > + unsigned int prefs, > + struct in6_addr *saddr); > }; > extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly; > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 0448700890f7..a6bf686eecbc 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -3257,6 +3257,10 @@ union bpf_attr { > * and *params*->smac will not be set as output. A common > * use case is to call **bpf_redirect_neigh**\ () after > * doing **bpf_fib_lookup**\ (). > + * **BPF_FIB_LOOKUP_SET_SRC** > + * Derive and set source IP addr in *params*->ipv{4,6}_src > + * for the nexthop. If the src addr cannot be derived, > + * **BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned. > * > * *ctx* is either **struct xdp_md** for XDP programs or > * **struct sk_buff** tc cls_act programs. > @@ -6953,6 +6957,7 @@ enum { > BPF_FIB_LOOKUP_OUTPUT = (1U << 1), > BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2), > BPF_FIB_LOOKUP_TBID = (1U << 3), > + BPF_FIB_LOOKUP_SET_SRC = (1U << 4), bikeshedding: Shorten it to BPF_FIB_LOOKUP_SRC? > }; > > enum { > @@ -6965,6 +6970,7 @@ enum { > BPF_FIB_LKUP_RET_UNSUPP_LWT, /* fwd requires encapsulation */ > BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */ > BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */ > + BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */ > }; > > struct bpf_fib_lookup { > @@ -6999,6 +7005,9 @@ struct bpf_fib_lookup { > __u32 rt_metric; > }; > > + /* input: source address to consider for lookup > + * output: source address result from lookup > + */ > union { > __be32 ipv4_src; > __u32 ipv6_src[4]; /* in6_addr; network order */ > diff --git a/net/core/filter.c b/net/core/filter.c > index a094694899c9..f3777ef1840b 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -5850,6 +5850,9 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, > params->rt_metric = res.fi->fib_priority; > params->ifindex = dev->ifindex; > > + if (flags & BPF_FIB_LOOKUP_SET_SRC) > + params->ipv4_src = fib_result_prefsrc(net, &res); Does it need to check 0 and return BPF_FIB_LKUP_RET_NO_SRC_ADDR for v4 also, or the fib_result_prefsrc does not fail? > + > /* xdp and cls_bpf programs are run in RCU-bh so > * rcu_read_lock_bh is not needed here > */ > @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, > params->rt_metric = res.f6i->fib6_metric; > params->ifindex = dev->ifindex; > > + if (flags & BPF_FIB_LOOKUP_SET_SRC) { > + if (res.f6i->fib6_prefsrc.plen) { > + *(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr; > + } else { > + err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev, > + &fl6.daddr, 0, > + (struct in6_addr *) > + params->ipv6_src); > + if (err) > + return BPF_FIB_LKUP_RET_NO_SRC_ADDR; This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of improving the API. May be others have some ideas. Considering dev has no saddr is probably (?) an unlikely case, it should be ok to leave it as is but at least a comment in the uapi will be needed. Otherwise, the bpf prog may use the 0 dmac as-is. I feel the current bpf_ipv[46]_fib_lookup helper is doing many things in one function and then requires different BPF_FIB_LOOKUP_* bits to select what/how to do. In the future, it may be worth to consider breaking it into smaller kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc. > + } > + } > + > if (flags & BPF_FIB_LOOKUP_SKIP_NEIGH) > goto set_fwd_params; > > @@ -6010,7 +6026,8 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, > #endif > > #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \ > - BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID) > + BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \ > + BPF_FIB_LOOKUP_SET_SRC) > > BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx, > struct bpf_fib_lookup *, params, int, plen, u32, flags)
Thanks for the review. On Wed, Oct 4, 2023, at 10:09 PM, Martin KaFai Lau wrote: > On 10/3/23 12:10 AM, Martynas Pumputis wrote: >> Extend the bpf_fib_lookup() helper by making it to return the source >> IPv4/IPv6 address if the BPF_FIB_LOOKUP_SET_SRC flag is set. >> >> For example, the following snippet can be used to derive the desired >> source IP address: >> >> struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr }; >> >> ret = bpf_skb_fib_lookup(skb, p, sizeof(p), >> BPF_FIB_LOOKUP_SET_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH); >> if (ret != BPF_FIB_LKUP_RET_SUCCESS) >> return TC_ACT_SHOT; >> >> /* the p.ipv4_src now contains the source address */ >> >> The inability to derive the proper source address may cause malfunctions >> in BPF-based dataplanes for hosts containing netdevs with more than one >> routable IP address or for multi-homed hosts. >> >> For example, Cilium implements packet masquerading in BPF. If an >> egressing netdev to which the Cilium's BPF prog is attached has >> multiple IP addresses, then only one [hardcoded] IP address can be used for >> masquerading. This breaks connectivity if any other IP address should have >> been selected instead, for example, when a public and private addresses >> are attached to the same egress interface. >> >> The change was tested with Cilium [1]. >> >> Nikolay Aleksandrov helped to figure out the IPv6 addr selection. >> >> [1]: https://github.com/cilium/cilium/pull/28283 >> >> Signed-off-by: Martynas Pumputis <m@lambda.lt> >> --- >> include/net/ipv6_stubs.h | 5 +++++ >> include/uapi/linux/bpf.h | 9 +++++++++ >> net/core/filter.c | 19 ++++++++++++++++++- >> net/ipv6/af_inet6.c | 1 + >> tools/include/uapi/linux/bpf.h | 10 ++++++++++ >> 5 files changed, 43 insertions(+), 1 deletion(-) >> >> diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h >> index c48186bf4737..21da31e1dff5 100644 >> --- a/include/net/ipv6_stubs.h >> +++ b/include/net/ipv6_stubs.h >> @@ -85,6 +85,11 @@ struct ipv6_bpf_stub { >> sockptr_t optval, unsigned int optlen); >> int (*ipv6_getsockopt)(struct sock *sk, int level, int optname, >> sockptr_t optval, sockptr_t optlen); >> + int (*ipv6_dev_get_saddr)(struct net *net, >> + const struct net_device *dst_dev, >> + const struct in6_addr *daddr, >> + unsigned int prefs, >> + struct in6_addr *saddr); >> }; >> extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly; >> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 0448700890f7..a6bf686eecbc 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -3257,6 +3257,10 @@ union bpf_attr { >> * and *params*->smac will not be set as output. A common >> * use case is to call **bpf_redirect_neigh**\ () after >> * doing **bpf_fib_lookup**\ (). >> + * **BPF_FIB_LOOKUP_SET_SRC** >> + * Derive and set source IP addr in *params*->ipv{4,6}_src >> + * for the nexthop. If the src addr cannot be derived, >> + * **BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned. >> * >> * *ctx* is either **struct xdp_md** for XDP programs or >> * **struct sk_buff** tc cls_act programs. >> @@ -6953,6 +6957,7 @@ enum { >> BPF_FIB_LOOKUP_OUTPUT = (1U << 1), >> BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2), >> BPF_FIB_LOOKUP_TBID = (1U << 3), >> + BPF_FIB_LOOKUP_SET_SRC = (1U << 4), > > bikeshedding: Shorten it to BPF_FIB_LOOKUP_SRC? SGTM. > >> }; >> >> enum { >> @@ -6965,6 +6970,7 @@ enum { >> BPF_FIB_LKUP_RET_UNSUPP_LWT, /* fwd requires encapsulation */ >> BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */ >> BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */ >> + BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */ >> }; >> >> struct bpf_fib_lookup { >> @@ -6999,6 +7005,9 @@ struct bpf_fib_lookup { >> __u32 rt_metric; >> }; >> >> + /* input: source address to consider for lookup >> + * output: source address result from lookup >> + */ >> union { >> __be32 ipv4_src; >> __u32 ipv6_src[4]; /* in6_addr; network order */ >> diff --git a/net/core/filter.c b/net/core/filter.c >> index a094694899c9..f3777ef1840b 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c >> @@ -5850,6 +5850,9 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, >> params->rt_metric = res.fi->fib_priority; >> params->ifindex = dev->ifindex; >> >> + if (flags & BPF_FIB_LOOKUP_SET_SRC) >> + params->ipv4_src = fib_result_prefsrc(net, &res); > > Does it need to check 0 and return BPF_FIB_LKUP_RET_NO_SRC_ADDR for v4 also, > or the fib_result_prefsrc does not fail? From inspecting the other usages of the function - it does not fail. > >> + >> /* xdp and cls_bpf programs are run in RCU-bh so >> * rcu_read_lock_bh is not needed here >> */ >> @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, >> params->rt_metric = res.f6i->fib6_metric; >> params->ifindex = dev->ifindex; >> >> + if (flags & BPF_FIB_LOOKUP_SET_SRC) { >> + if (res.f6i->fib6_prefsrc.plen) { >> + *(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr; >> + } else { >> + err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev, >> + &fl6.daddr, 0, >> + (struct in6_addr *) >> + params->ipv6_src); >> + if (err) >> + return BPF_FIB_LKUP_RET_NO_SRC_ADDR; > > This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of > improving the API. May be others have some ideas. > > Considering dev has no saddr is probably (?) an unlikely case, it should be ok > to leave it as is but at least a comment in the uapi will be needed. Otherwise, > the bpf prog may use the 0 dmac as-is. I expect that a user of the helper checks that err == 0 before using any of the output params. > > I feel the current bpf_ipv[46]_fib_lookup helper is doing many things > in one > function and then requires different BPF_FIB_LOOKUP_* bits to select > what/how to > do. In the future, it may be worth to consider breaking it into smaller > kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc. > Yep, good idea. At least it seems that the neigh lookup could live in its own function. >> + } >> + } >> + >> if (flags & BPF_FIB_LOOKUP_SKIP_NEIGH) >> goto set_fwd_params; >> >> @@ -6010,7 +6026,8 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, >> #endif >> >> #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \ >> - BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID) >> + BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \ >> + BPF_FIB_LOOKUP_SET_SRC) >> >> BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx, >> struct bpf_fib_lookup *, params, int, plen, u32, flags)
On 10/5/23 1:16 PM, Martynas wrote: >>> @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, >>> params->rt_metric = res.f6i->fib6_metric; >>> params->ifindex = dev->ifindex; >>> >>> + if (flags & BPF_FIB_LOOKUP_SET_SRC) { >>> + if (res.f6i->fib6_prefsrc.plen) { >>> + *(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr; A nit. just noticed. Similar to the "*dst" assignment a few lines above: *src = res.f6i->fib6_prefsrc.addr; >>> + } else { >>> + err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev, >>> + &fl6.daddr, 0, >>> + (struct in6_addr *) >>> + params->ipv6_src); Same here. Use the "src". >>> + if (err) >>> + return BPF_FIB_LKUP_RET_NO_SRC_ADDR; >> >> This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of >> improving the API. May be others have some ideas. >> >> Considering dev has no saddr is probably (?) an unlikely case, it should be ok >> to leave it as is but at least a comment in the uapi will be needed. Otherwise, >> the bpf prog may use the 0 dmac as-is. > > I expect that a user of the helper checks that err == 0 before using any of the output params. For example, the bpf prog gets BPF_FIB_LKUP_RET_NO_NEIGH and learns neigh is not available but ipv6_dst (and the optional ipv6_src) is still valid. If the bpf prog gets BPF_FIB_LKUP_RET_NO_SRC_ADDR, intuitively, only ipv6_src is not available. The bpf prog will continue to use the ipv6_dst and dmac (which is actually 0). > >> >> I feel the current bpf_ipv[46]_fib_lookup helper is doing many things >> in one >> function and then requires different BPF_FIB_LOOKUP_* bits to select >> what/how to >> do. In the future, it may be worth to consider breaking it into smaller >> kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc. >> > > Yep, good idea. At least it seems that the neigh lookup could live in its own function. To be clear, it could be independent of this set. Thanks.
On Fri, Oct 6, 2023, at 8:29 AM, Martin KaFai Lau wrote: > On 10/5/23 1:16 PM, Martynas wrote: >>>> @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, >>>> params->rt_metric = res.f6i->fib6_metric; >>>> params->ifindex = dev->ifindex; >>>> >>>> + if (flags & BPF_FIB_LOOKUP_SET_SRC) { >>>> + if (res.f6i->fib6_prefsrc.plen) { >>>> + *(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr; > > A nit. just noticed. Similar to the "*dst" assignment a few lines above: SGTM. > > *src = res.f6i->fib6_prefsrc.addr; > >>>> + } else { >>>> + err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev, >>>> + &fl6.daddr, 0, >>>> + (struct in6_addr *) >>>> + params->ipv6_src); > > Same here. Use the "src". > >>>> + if (err) >>>> + return BPF_FIB_LKUP_RET_NO_SRC_ADDR; >>> >>> This error also implies BPF_FIB_LKUP_RET_NO_NEIGH. I don't have a clean way of >>> improving the API. May be others have some ideas. >>> >>> Considering dev has no saddr is probably (?) an unlikely case, it should be ok >>> to leave it as is but at least a comment in the uapi will be needed. Otherwise, >>> the bpf prog may use the 0 dmac as-is. >> >> I expect that a user of the helper checks that err == 0 before using any of the output params. > > For example, the bpf prog gets BPF_FIB_LKUP_RET_NO_NEIGH and learns > neigh is not > available but ipv6_dst (and the optional ipv6_src) is still valid. > > If the bpf prog gets BPF_FIB_LKUP_RET_NO_SRC_ADDR, intuitively, only > ipv6_src is > not available. The bpf prog will continue to use the ipv6_dst and dmac > (which is > actually 0). > Thinking out loud, we could make BPF_FIB_LOOKUP_SRC to require BPF_FIB_LOOKUP_SKIP_NEIGH, but then for some cases a user would be required to call the helper twice. This is a no-go due perf and instruction count reasons. Nothing betters comes than explicitly documenting the behavior in the uapi comments. >> >>> >>> I feel the current bpf_ipv[46]_fib_lookup helper is doing many things >>> in one >>> function and then requires different BPF_FIB_LOOKUP_* bits to select >>> what/how to >>> do. In the future, it may be worth to consider breaking it into smaller >>> kfunc(s). e.g. the __ipv[46]_neigh_lookup could be in its own kfunc. >>> >> >> Yep, good idea. At least it seems that the neigh lookup could live in its own function. > > To be clear, it could be independent of this set. > > Thanks.
diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h index c48186bf4737..21da31e1dff5 100644 --- a/include/net/ipv6_stubs.h +++ b/include/net/ipv6_stubs.h @@ -85,6 +85,11 @@ struct ipv6_bpf_stub { sockptr_t optval, unsigned int optlen); int (*ipv6_getsockopt)(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen); + int (*ipv6_dev_get_saddr)(struct net *net, + const struct net_device *dst_dev, + const struct in6_addr *daddr, + unsigned int prefs, + struct in6_addr *saddr); }; extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0448700890f7..a6bf686eecbc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3257,6 +3257,10 @@ union bpf_attr { * and *params*->smac will not be set as output. A common * use case is to call **bpf_redirect_neigh**\ () after * doing **bpf_fib_lookup**\ (). + * **BPF_FIB_LOOKUP_SET_SRC** + * Derive and set source IP addr in *params*->ipv{4,6}_src + * for the nexthop. If the src addr cannot be derived, + * **BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned. * * *ctx* is either **struct xdp_md** for XDP programs or * **struct sk_buff** tc cls_act programs. @@ -6953,6 +6957,7 @@ enum { BPF_FIB_LOOKUP_OUTPUT = (1U << 1), BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2), BPF_FIB_LOOKUP_TBID = (1U << 3), + BPF_FIB_LOOKUP_SET_SRC = (1U << 4), }; enum { @@ -6965,6 +6970,7 @@ enum { BPF_FIB_LKUP_RET_UNSUPP_LWT, /* fwd requires encapsulation */ BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */ BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */ + BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */ }; struct bpf_fib_lookup { @@ -6999,6 +7005,9 @@ struct bpf_fib_lookup { __u32 rt_metric; }; + /* input: source address to consider for lookup + * output: source address result from lookup + */ union { __be32 ipv4_src; __u32 ipv6_src[4]; /* in6_addr; network order */ diff --git a/net/core/filter.c b/net/core/filter.c index a094694899c9..f3777ef1840b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5850,6 +5850,9 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, params->rt_metric = res.fi->fib_priority; params->ifindex = dev->ifindex; + if (flags & BPF_FIB_LOOKUP_SET_SRC) + params->ipv4_src = fib_result_prefsrc(net, &res); + /* xdp and cls_bpf programs are run in RCU-bh so * rcu_read_lock_bh is not needed here */ @@ -5992,6 +5995,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, params->rt_metric = res.f6i->fib6_metric; params->ifindex = dev->ifindex; + if (flags & BPF_FIB_LOOKUP_SET_SRC) { + if (res.f6i->fib6_prefsrc.plen) { + *(struct in6_addr *)params->ipv6_src = res.f6i->fib6_prefsrc.addr; + } else { + err = ipv6_bpf_stub->ipv6_dev_get_saddr(net, dev, + &fl6.daddr, 0, + (struct in6_addr *) + params->ipv6_src); + if (err) + return BPF_FIB_LKUP_RET_NO_SRC_ADDR; + } + } + if (flags & BPF_FIB_LOOKUP_SKIP_NEIGH) goto set_fwd_params; @@ -6010,7 +6026,8 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, #endif #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \ - BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID) + BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \ + BPF_FIB_LOOKUP_SET_SRC) BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx, struct bpf_fib_lookup *, params, int, plen, u32, flags) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 368824fe9719..5382c6543d46 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -1060,6 +1060,7 @@ static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = { .udp6_lib_lookup = __udp6_lib_lookup, .ipv6_setsockopt = do_ipv6_setsockopt, .ipv6_getsockopt = do_ipv6_getsockopt, + .ipv6_dev_get_saddr = ipv6_dev_get_saddr, }; static int __init inet6_init(void) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 0448700890f7..72cd0ca97689 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3257,6 +3257,10 @@ union bpf_attr { * and *params*->smac will not be set as output. A common * use case is to call **bpf_redirect_neigh**\ () after * doing **bpf_fib_lookup**\ (). + * **BPF_FIB_LOOKUP_SET_SRC** + * Derive and set source IP addr in *params*->ipv{4,6}_src + * for the nexthop. If the src addr cannot be derived, + * **BPF_FIB_LKUP_RET_NO_SRC_ADDR** is returned. * * *ctx* is either **struct xdp_md** for XDP programs or * **struct sk_buff** tc cls_act programs. @@ -6953,6 +6957,7 @@ enum { BPF_FIB_LOOKUP_OUTPUT = (1U << 1), BPF_FIB_LOOKUP_SKIP_NEIGH = (1U << 2), BPF_FIB_LOOKUP_TBID = (1U << 3), + BPF_FIB_LOOKUP_SET_SRC = (1U << 4), }; enum { @@ -6965,6 +6970,7 @@ enum { BPF_FIB_LKUP_RET_UNSUPP_LWT, /* fwd requires encapsulation */ BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */ BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */ + BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */ }; struct bpf_fib_lookup { @@ -6999,6 +7005,10 @@ struct bpf_fib_lookup { __u32 rt_metric; }; + + /* input: source address to consider for lookup + * output: source address result from lookup + */ union { __be32 ipv4_src; __u32 ipv6_src[4]; /* in6_addr; network order */
Extend the bpf_fib_lookup() helper by making it to return the source IPv4/IPv6 address if the BPF_FIB_LOOKUP_SET_SRC flag is set. For example, the following snippet can be used to derive the desired source IP address: struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr }; ret = bpf_skb_fib_lookup(skb, p, sizeof(p), BPF_FIB_LOOKUP_SET_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH); if (ret != BPF_FIB_LKUP_RET_SUCCESS) return TC_ACT_SHOT; /* the p.ipv4_src now contains the source address */ The inability to derive the proper source address may cause malfunctions in BPF-based dataplanes for hosts containing netdevs with more than one routable IP address or for multi-homed hosts. For example, Cilium implements packet masquerading in BPF. If an egressing netdev to which the Cilium's BPF prog is attached has multiple IP addresses, then only one [hardcoded] IP address can be used for masquerading. This breaks connectivity if any other IP address should have been selected instead, for example, when a public and private addresses are attached to the same egress interface. The change was tested with Cilium [1]. Nikolay Aleksandrov helped to figure out the IPv6 addr selection. [1]: https://github.com/cilium/cilium/pull/28283 Signed-off-by: Martynas Pumputis <m@lambda.lt> --- include/net/ipv6_stubs.h | 5 +++++ include/uapi/linux/bpf.h | 9 +++++++++ net/core/filter.c | 19 ++++++++++++++++++- net/ipv6/af_inet6.c | 1 + tools/include/uapi/linux/bpf.h | 10 ++++++++++ 5 files changed, 43 insertions(+), 1 deletion(-)