Message ID | cffd245430d10fa2a14c32d1c768eef7cfeb8963.1646068241.git.gnault@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] ipv4: fix route lookups when handling ICMP redirects and PMTU updates | expand |
On 2/28/22 10:16 AM, Guillaume Nault wrote: > Fixes: d3a25c980fc2 ("ipv4: Fix nexthop exception hash computation.") That does not seem related to tos in the flow struct at all. > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index f33ad1f383b6..d5d058de3664 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -499,6 +499,15 @@ void __ip_select_ident(struct net *net, struct iphdr *iph, int segs) > } > EXPORT_SYMBOL(__ip_select_ident); > > +static void ip_rt_fix_tos(struct flowi4 *fl4) make this a static inline in include/net/flow.h and update flowi4_init_output and flowi4_update_output to use it. That should cover a few of the cases below leaving just ... > +{ > + __u8 tos = RT_FL_TOS(fl4); > + > + fl4->flowi4_tos = tos & IPTOS_RT_MASK; > + fl4->flowi4_scope = tos & RTO_ONLINK ? > + RT_SCOPE_LINK : RT_SCOPE_UNIVERSE; > +} > + > static void __build_flow_key(const struct net *net, struct flowi4 *fl4, > const struct sock *sk, > const struct iphdr *iph, > @@ -824,6 +833,7 @@ static void ip_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buf > rt = (struct rtable *) dst; > > __build_flow_key(net, &fl4, sk, iph, oif, tos, prot, mark, 0); > + ip_rt_fix_tos(&fl4); > __ip_do_redirect(rt, skb, &fl4, true); > } > > @@ -1048,6 +1058,7 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, > struct flowi4 fl4; > > ip_rt_build_flow_key(&fl4, sk, skb); > + ip_rt_fix_tos(&fl4); > > /* Don't make lookup fail for bridged encapsulations */ > if (skb && netif_is_any_bridge_port(skb->dev)) > @@ -1122,6 +1133,8 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu) > goto out; > > new = true; > + } else { > + ip_rt_fix_tos(&fl4); > } > > __ip_rt_update_pmtu((struct rtable *)xfrm_dst_path(&rt->dst), &fl4, mtu); > @@ -2603,7 +2616,6 @@ static struct rtable *__mkroute_output(const struct fib_result *res, > struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, > const struct sk_buff *skb) > { > - __u8 tos = RT_FL_TOS(fl4); > struct fib_result res = { > .type = RTN_UNSPEC, > .fi = NULL, > @@ -2613,9 +2625,7 @@ struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, > struct rtable *rth; > > fl4->flowi4_iif = LOOPBACK_IFINDEX; > - fl4->flowi4_tos = tos & IPTOS_RT_MASK; > - fl4->flowi4_scope = ((tos & RTO_ONLINK) ? > - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); > + ip_rt_fix_tos(fl4); ... this one to call the new helper. > > rcu_read_lock(); > rth = ip_route_output_key_hash_rcu(net, fl4, &res, skb);
On 2/28/22 10:16 AM, Guillaume Nault wrote: > --- > net/ipv4/route.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > also, add test cases to the pmtu script.
On Mon, Feb 28, 2022 at 10:31:58AM -0700, David Ahern wrote: > On 2/28/22 10:16 AM, Guillaume Nault wrote: > > Fixes: d3a25c980fc2 ("ipv4: Fix nexthop exception hash computation.") > > That does not seem related to tos in the flow struct at all. Ouch, copy/paste mistake. I meant 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), which is the next commit with 'git log -- net/ipv4/route.c'. Really sorry :/, and thanks a lot for catching that! > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > index f33ad1f383b6..d5d058de3664 100644 > > --- a/net/ipv4/route.c > > +++ b/net/ipv4/route.c > > @@ -499,6 +499,15 @@ void __ip_select_ident(struct net *net, struct iphdr *iph, int segs) > > } > > EXPORT_SYMBOL(__ip_select_ident); > > > > +static void ip_rt_fix_tos(struct flowi4 *fl4) > > make this a static inline in include/net/flow.h and update > flowi4_init_output and flowi4_update_output to use it. That should cover > a few of the cases below leaving just ... Hum, I didn't think about this option, but it looks risky to me. As I put it in note 1, ip_route_output_key_hash() unconditionally sets ->flowi4_scope, assuming it can infer the scope from the RTO_ONLINK bit of ->flowi4_tos. If we santise these fields in flowi4_init_output() (and flowi4_update_output()), then ip_route_output_key_hash() would sometimes work on already santised values and sometimes not. So it wouldn't know if it should initialise ->flowi4_scope. We could decide to let ip_route_output_key_hash() initialise ->flowi4_scope only when the RTO_ONLINK bit is set, which guarantees that we don't have sanitised values. But before that, we'd need to audit all other callers, to verify that they correctly initialise the ->flowi4_scope with RT_SCOPE_UNIVERSE, since ip_route_output_key_hash() isn't going do it for them anymore. I'll audit all these callers, but that should be something for net-next. > > @@ -2613,9 +2625,7 @@ struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, > > struct rtable *rth; > > > > fl4->flowi4_iif = LOOPBACK_IFINDEX; > > - fl4->flowi4_tos = tos & IPTOS_RT_MASK; > > - fl4->flowi4_scope = ((tos & RTO_ONLINK) ? > > - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); > > + ip_rt_fix_tos(fl4); > > ... this one to call the new helper. BTW, here's a bit more about the context around this patch. I found the problem while working on removing the use of RTO_ONLINK, so that ->flowi4_tos could be converted to dscp_t. The objective is to modify callers so that they'd set ->flowi4_scope directly, instead using RTO_ONLINK to mark their intention (and that's why I said I'd have to audit them anyway). Once that will be done, ip_rt_fix_tos() won't have to touch the scope anymore. And once ->flowi4_tos will be converted to dscp_t, we'll can remove that function entirely since dscp_t ensures ECN bits are cleared (IPTOS_RT_MASK also ensures that high order bits are cleared too, but that's redundant with the RT_TOS() calls already done by callers, and which somewhat aren't really desirable anyway). > > > > rcu_read_lock(); > > rth = ip_route_output_key_hash_rcu(net, fl4, &res, skb); >
On 2/28/22 1:54 PM, Guillaume Nault wrote: > On Mon, Feb 28, 2022 at 10:31:58AM -0700, David Ahern wrote: >> On 2/28/22 10:16 AM, Guillaume Nault wrote: >>> Fixes: d3a25c980fc2 ("ipv4: Fix nexthop exception hash computation.") >> >> That does not seem related to tos in the flow struct at all. > > Ouch, copy/paste mistake. > I meant 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), which is > the next commit with 'git log -- net/ipv4/route.c'. > Really sorry :/, and thanks a lot for catching that! > >>> diff --git a/net/ipv4/route.c b/net/ipv4/route.c >>> index f33ad1f383b6..d5d058de3664 100644 >>> --- a/net/ipv4/route.c >>> +++ b/net/ipv4/route.c >>> @@ -499,6 +499,15 @@ void __ip_select_ident(struct net *net, struct iphdr *iph, int segs) >>> } >>> EXPORT_SYMBOL(__ip_select_ident); >>> >>> +static void ip_rt_fix_tos(struct flowi4 *fl4) >> >> make this a static inline in include/net/flow.h and update >> flowi4_init_output and flowi4_update_output to use it. That should cover >> a few of the cases below leaving just ... > > Hum, I didn't think about this option, but it looks risky to me. As I > put it in note 1, ip_route_output_key_hash() unconditionally sets > ->flowi4_scope, assuming it can infer the scope from the RTO_ONLINK bit > of ->flowi4_tos. If we santise these fields in flowi4_init_output() > (and flowi4_update_output()), then ip_route_output_key_hash() would > sometimes work on already santised values and sometimes not. So it > wouldn't know if it should initialise ->flowi4_scope. > > We could decide to let ip_route_output_key_hash() initialise > ->flowi4_scope only when the RTO_ONLINK bit is set, which > guarantees that we don't have sanitised values. But before that, we'd > need to audit all other callers, to verify that they correctly > initialise the ->flowi4_scope with RT_SCOPE_UNIVERSE, since > ip_route_output_key_hash() isn't going do it for them anymore. > I'll audit all these callers, but that should be something for > net-next. I'm not following the response. You are moving the tos logic from ip_route_output_key_hash to a helper and calling the new helper for other fib lookups. My suggestion was to correctly set / fixup the tos and scope when flowi4 is initialized (reducing the number of places the fixup is needed) and recognizing below that ip_route_output_key_hash still needs the call to the new ip_rt_fix_tos. > >>> @@ -2613,9 +2625,7 @@ struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, >>> struct rtable *rth; >>> >>> fl4->flowi4_iif = LOOPBACK_IFINDEX; >>> - fl4->flowi4_tos = tos & IPTOS_RT_MASK; >>> - fl4->flowi4_scope = ((tos & RTO_ONLINK) ? >>> - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); >>> + ip_rt_fix_tos(fl4); >> >> ... this one to call the new helper. > > BTW, here's a bit more about the context around this patch. > I found the problem while working on removing the use of RTO_ONLINK, so > that ->flowi4_tos could be converted to dscp_t. > > The objective is to modify callers so that they'd set ->flowi4_scope > directly, instead using RTO_ONLINK to mark their intention (and that's > why I said I'd have to audit them anyway). > > Once that will be done, ip_rt_fix_tos() won't have to touch the scope > anymore. And once ->flowi4_tos will be converted to dscp_t, we'll can > remove that function entirely since dscp_t ensures ECN bits are cleared > (IPTOS_RT_MASK also ensures that high order bits are cleared too, but > that's redundant with the RT_TOS() calls already done by callers, and > which somewhat aren't really desirable anyway). >
On Mon, Feb 28, 2022 at 09:31:09PM -0700, David Ahern wrote: > On 2/28/22 1:54 PM, Guillaume Nault wrote: > > On Mon, Feb 28, 2022 at 10:31:58AM -0700, David Ahern wrote: > >> On 2/28/22 10:16 AM, Guillaume Nault wrote: > >>> Fixes: d3a25c980fc2 ("ipv4: Fix nexthop exception hash computation.") > >> > >> That does not seem related to tos in the flow struct at all. > > > > Ouch, copy/paste mistake. > > I meant 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), which is > > the next commit with 'git log -- net/ipv4/route.c'. > > Really sorry :/, and thanks a lot for catching that! > > > >>> diff --git a/net/ipv4/route.c b/net/ipv4/route.c > >>> index f33ad1f383b6..d5d058de3664 100644 > >>> --- a/net/ipv4/route.c > >>> +++ b/net/ipv4/route.c > >>> @@ -499,6 +499,15 @@ void __ip_select_ident(struct net *net, struct iphdr *iph, int segs) > >>> } > >>> EXPORT_SYMBOL(__ip_select_ident); > >>> > >>> +static void ip_rt_fix_tos(struct flowi4 *fl4) > >> > >> make this a static inline in include/net/flow.h and update > >> flowi4_init_output and flowi4_update_output to use it. That should cover > >> a few of the cases below leaving just ... > > > > Hum, I didn't think about this option, but it looks risky to me. As I > > put it in note 1, ip_route_output_key_hash() unconditionally sets > > ->flowi4_scope, assuming it can infer the scope from the RTO_ONLINK bit > > of ->flowi4_tos. If we santise these fields in flowi4_init_output() > > (and flowi4_update_output()), then ip_route_output_key_hash() would > > sometimes work on already santised values and sometimes not. So it > > wouldn't know if it should initialise ->flowi4_scope. > > > > We could decide to let ip_route_output_key_hash() initialise > > ->flowi4_scope only when the RTO_ONLINK bit is set, which > > guarantees that we don't have sanitised values. But before that, we'd > > need to audit all other callers, to verify that they correctly > > initialise the ->flowi4_scope with RT_SCOPE_UNIVERSE, since > > ip_route_output_key_hash() isn't going do it for them anymore. > > I'll audit all these callers, but that should be something for > > net-next. > > I'm not following the response. You are moving the tos logic from > ip_route_output_key_hash to a helper and calling the new helper for > other fib lookups. My suggestion was to correctly set / fixup the tos > and scope when flowi4 is initialized (reducing the number of places the > fixup is needed) and recognizing below that ip_route_output_key_hash > still needs the call to the new ip_rt_fix_tos. The problem is that we can't santitise fl4 twice: fl4->flowi4_tos = 0x04 | RTO_ONLINK; fl4->flowi4_scope = whatever; ip_rt_fix_tos(fl4); /* Now ->flowi4_tos == 0x04 and ->flowi4_scope == RT_SCOPE_LINK */ ip_rt_fix_tos(fl4); /* Now ->flowi4_scope is wrongly changed to RT_SCOPE_UNIVERSE */ Therefore we can't call the helper in ip_route_output_key_hash() "just in case", because that has to be done exactly once and we can't know whether fl4 has already been sanitised or not. The second part of my reply was about trying to allow double calls to ip_rt_fix_tos() (as it's required for the solution you proposed). It looks like all call paths initialise ->flowi4_scope to zero (that is, RT_SCOPE_UNIVERSE). If that's really the case, then ip_rt_fix_tos() could reset ->flowi4_scope only when RTO_ONLINK is on. Then we wouldn't have to worry about the problem described above. But that requires auditing all code paths to ensure that they all of properly initialise the scope to RT_SCOPE_UNIVERSE, otherwise we risk introducing regressions because of uninitialised ->flowi4_scope. So this kind of work seems better suited for the net-next tree. And my final point was that the need for ip_rt_fix_tos() is temporary: I plan to do the call paths review anyway, to make them initialise tos and scope properly, thus removing the need for RTO_ONLINK. I already have a draft patch series, but as I said that's work for net-next. > > > >>> @@ -2613,9 +2625,7 @@ struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, > >>> struct rtable *rth; > >>> > >>> fl4->flowi4_iif = LOOPBACK_IFINDEX; > >>> - fl4->flowi4_tos = tos & IPTOS_RT_MASK; > >>> - fl4->flowi4_scope = ((tos & RTO_ONLINK) ? > >>> - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); > >>> + ip_rt_fix_tos(fl4); > >> > >> ... this one to call the new helper. > > > > BTW, here's a bit more about the context around this patch. > > I found the problem while working on removing the use of RTO_ONLINK, so > > that ->flowi4_tos could be converted to dscp_t. > > > > The objective is to modify callers so that they'd set ->flowi4_scope > > directly, instead using RTO_ONLINK to mark their intention (and that's > > why I said I'd have to audit them anyway). > > > > Once that will be done, ip_rt_fix_tos() won't have to touch the scope > > anymore. And once ->flowi4_tos will be converted to dscp_t, we'll can > > remove that function entirely since dscp_t ensures ECN bits are cleared > > (IPTOS_RT_MASK also ensures that high order bits are cleared too, but > > that's redundant with the RT_TOS() calls already done by callers, and > > which somewhat aren't really desirable anyway). > > > > >
On 3/1/22 4:41 AM, Guillaume Nault wrote: > And my final point was that the need for ip_rt_fix_tos() is temporary: > I plan to do the call paths review anyway, to make them initialise tos > and scope properly, thus removing the need for RTO_ONLINK. I already > have a draft patch series, but as I said that's work for net-next. ok, if the cleanup is going to happen in -next then this simple patch seems good for net and stable
On Wed, Mar 02, 2022 at 09:19:43AM -0700, David Ahern wrote: > On 3/1/22 4:41 AM, Guillaume Nault wrote: > > And my final point was that the need for ip_rt_fix_tos() is temporary: > > I plan to do the call paths review anyway, to make them initialise tos > > and scope properly, thus removing the need for RTO_ONLINK. I already > > have a draft patch series, but as I said that's work for net-next. > > ok, if the cleanup is going to happen in -next then this simple patch > seems good for net and stable That's the plan, though I won't have time to submit the cleanup before the next merge window (as we're already at -rc6 and I still have more work to do to remove RTO_ONLINK).
diff --git a/net/ipv4/route.c b/net/ipv4/route.c index f33ad1f383b6..d5d058de3664 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -499,6 +499,15 @@ void __ip_select_ident(struct net *net, struct iphdr *iph, int segs) } EXPORT_SYMBOL(__ip_select_ident); +static void ip_rt_fix_tos(struct flowi4 *fl4) +{ + __u8 tos = RT_FL_TOS(fl4); + + fl4->flowi4_tos = tos & IPTOS_RT_MASK; + fl4->flowi4_scope = tos & RTO_ONLINK ? + RT_SCOPE_LINK : RT_SCOPE_UNIVERSE; +} + static void __build_flow_key(const struct net *net, struct flowi4 *fl4, const struct sock *sk, const struct iphdr *iph, @@ -824,6 +833,7 @@ static void ip_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buf rt = (struct rtable *) dst; __build_flow_key(net, &fl4, sk, iph, oif, tos, prot, mark, 0); + ip_rt_fix_tos(&fl4); __ip_do_redirect(rt, skb, &fl4, true); } @@ -1048,6 +1058,7 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, struct flowi4 fl4; ip_rt_build_flow_key(&fl4, sk, skb); + ip_rt_fix_tos(&fl4); /* Don't make lookup fail for bridged encapsulations */ if (skb && netif_is_any_bridge_port(skb->dev)) @@ -1122,6 +1133,8 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu) goto out; new = true; + } else { + ip_rt_fix_tos(&fl4); } __ip_rt_update_pmtu((struct rtable *)xfrm_dst_path(&rt->dst), &fl4, mtu); @@ -2603,7 +2616,6 @@ static struct rtable *__mkroute_output(const struct fib_result *res, struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, const struct sk_buff *skb) { - __u8 tos = RT_FL_TOS(fl4); struct fib_result res = { .type = RTN_UNSPEC, .fi = NULL, @@ -2613,9 +2625,7 @@ struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, struct rtable *rth; fl4->flowi4_iif = LOOPBACK_IFINDEX; - fl4->flowi4_tos = tos & IPTOS_RT_MASK; - fl4->flowi4_scope = ((tos & RTO_ONLINK) ? - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); + ip_rt_fix_tos(fl4); rcu_read_lock(); rth = ip_route_output_key_hash_rcu(net, fl4, &res, skb);
The PMTU update and ICMP redirect helper functions initialise their fl4 variable with either __build_flow_key() or build_sk_flow_key(). These initialisation functions always set ->flowi4_scope with RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is not a problem when the route lookup is later done via ip_route_output_key_hash(), which properly clears the ECN bits from ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK flag. However, some helpers call fib_lookup() directly, without sanitising the tos and scope fields, so the route lookup can fail and, as a result, the ICMP redirect or PMTU update aren't taken into account. Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation code into ip_rt_fix_tos(), then use this function in handlers that call fib_lookup() directly. Note 1: we can't just let __build_flow_key() set sanitised values for tos and scope, because other functions use it and pass the flowi4 structure to ip_route_output_key_hash(), which unconditionally resets the scope to RT_SCOPE_UNIVERSE if it doesn't see the RTO_ONLINK flag in ->flowi4_tos. Note 2: while wrongly initialised ->flowi4_tos could interfere with ICMP redirects and PMTU updates, setting ->flowi4_scope with RT_SCOPE_UNIVERSE instead of RT_SCOPE_LINK probably wasn't really a problem: sockets with SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set) normally shouldn't receive redirects and PMTU updates. Fixes: d3a25c980fc2 ("ipv4: Fix nexthop exception hash computation.") Signed-off-by: Guillaume Nault <gnault@redhat.com> --- net/ipv4/route.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)