mbox series

[net-next,0/4] inet: Separate DSCP from ECN bits using new dscp_t type

Message ID cover.1638814614.git.gnault@redhat.com (mailing list archive)
Headers show
Series inet: Separate DSCP from ECN bits using new dscp_t type | expand

Message

Guillaume Nault Dec. 6, 2021, 6:22 p.m. UTC
Following my talk at LPC 2021 [1], here's a patch series whose
objective is to start fixing the problems with how DSCP and ECN bits
are handled in the kernel. This approach seemed to make consensus among
the participants, although it implies a few behaviour changes for some
corner cases of ip rule and ip route. Let's see if this consensus can
survive a wider review :).

The problem to solve
====================

Currently, the networking stack generally doesn't clearly distinguish
between DSCP and ECN bits. The entire DSCP+ECN bits are stored in a
single u8 variable (or structure field), and each part of the stack
handles them in their own way, using different macros.

This has been, and still is, the source of several bugs:

  * Invalid use of RT_TOS(), leading to regression in VXLAN:
      - a0dced17ad9d ("Revert "vxlan: fix tos value before xmit"")

  * Other invalid uses of RT_TOS() in IPv6 code, where it doesn't make
    sense (RT_TOS() follows the old RFC 1349, which was never meant for
    IPv6):
      - grep for 'ip6_make_flowinfo(RT_TOS(tos)' for several examples
        that need to be fixed.

  * Failure to properly mask the ECN bits before doing IPv4 route
    lookups, leading to returning a wrong route:
      - 2e5a6266fbb1 ("netfilter: rpfilter: mask ecn bits before fib lookup"),
      - 8d2b51b008c2 ("udp: mask TOS bits in udp_v4_early_demux()"),
      - 21fdca22eb7d ("ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()"),
      - 1ebf179037cb ("ipv4: Fix tos mask in inet_rtm_getroute()"),
      - some more, uncommon, code paths still need to be fixed.

Also, this creates inconsistencies between subsystems:

 * IPv4 routes accept tos/dsfield options that have ECN bits set, but
   no packet can actually match them, as their ECN bits are cleared
   before lookup.

 * IPv4 rules accept tos/dsfield options that have the high order ECN
   bit set (but rejects those using the low order ECN bit). Like IPv4
   routes, such rules can't match actual packets as the rule lookup is
   done after clearing the packets ECN bits.

 * IPv6 routes accept the tos/dsfield options without any restriction,
   but silently ignores them.

 * IPv6 rules also accept any value of tos/dsfield, but compares all
   8 bits, including ECN. Therefore, a rule matching packets with a
   particular DSCP value stops working as soon as ECN is used (a work
   around is to define 4 rules for each DSCP value to match, one for
   each possible ECN code point).

 * Modules that want to distinguish between the DSCP and ECN bits
   create their own local macros (Netfilter, SCTP).

What this RFC does
==================

This patch series creates a dscp_t type, meant to represent pure DSCP
values, excluding ECN bits. Doing so allows to clearly separate the
interpretation of DSCP and ECN bits and enables Sparse to detect
invalid combinations of dscp_t with the plain u8 variables generally
used to store the full TOS/Traffic Class/DS field.

Note, this patch series differs slightly from that of the original talk
(slide 14 [2]). For the talk, I just cleared the ECN bits, while in
this series, I do a bit-shift. This way dscp_t really represents DSCP
values, as defined in RFCs. Also I've renamed the helper functions to
replace "u8" by "dsfield", as I felt "u8" was ambiguous. Using
"dsfield" makes it clear that dscp_t to u8 conversion isn't just a
plain cast, but that a bit-shift happens and the result has the two ECN
bits.

The new dscp_t type is then used to convert several field members:

  * Patch 1 converts the tclass field of struct fib6_rule. It
    effectively forbids the use of ECN bits in the tos/dsfield option
    of ip -6 rule. Rules now match packets solely based on their DSCP
    bits, so ECN doesn't influence the result anymore. This contrasts
    with previous behaviour where all 8 bits of the Traffic Class field
    was used. It is believed this change is acceptable as matching ECN
    bits wasn't usable for IPv4, so only IPv6-only deployments could be
    depending on it (that is, it's unlikely enough that a someone uses
    ip6 rules matching ECN bits in production).

  * Patch 2 converts the tos field of struct fib4_rule. This one too
    effectively forbids defining ECN bits, this time in ip -4 rule.
    Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
    rejected. But even when accepted, the rule wouldn't match as the
    packets would normally have their ECN bits cleared while doing the
    rule lookup.

  * Patch 3 converts the fc_tos field of struct fib_config. This is
    like patch 2, but for ip4 routes. Routes using a tos/dsfield option
    with any ECN bit set is now rejected. Before this patch, they were
    accepted but, as with ip4 rules, these routes couldn't match any
    real packet, since callers were supposed to clear their ECN bits
    beforehand.

  * Patch 4 converts the fa_tos field of struct fib_alias. This one is
    pure internal u8 to dscp_t conversion. While patches 1-3 dealed
    with user facing consequences, this patch shouldn't have any side
    effect and is just there to give an overview of what such
    conversion patches will look like. These are quite mechanical, but
    imply some code churn.

Note that there's no equivalent of patch 3 for IPv6 (ip route), since
the tos/dsfield option is silently ignored for IPv6 routes.

Future work
===========

This is a minimal series, the objective of this RFC is just to validate
the dscp_t approach. More work will be needed to fully iron out the
problem:

  * Convert more internal structures to dscp_t (in particular the
    flowi4_tos field of struct flowi4).

  * Slowly remove the uses of IPTOS_TOS_MASK, and hence RT_TOS(), in
    the kernel, as it clears only one of the ECN bits and the TOS is
    going to be masked again anyway by IPTOS_RT_MASK, which is
    stricter.

  * Finally, start allowing high DSCP values in IPv4 routes and rules
    (the IPTOS_RT_MASK used in IPv4 clears the 3 high order bits of the
    DS field).

  * Maybe find a way to warn users that use the ignored tos/dsfield
    option with ip -6 route.

Feedbacks welcome!

Thanks,

William

[1] LPC talk: https://linuxplumbersconf.org/event/11/contributions/943/
[2] LPC slides: https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf


Guillaume Nault (4):
  ipv6: Define dscp_t and stop taking ECN bits into account in ip6-rules
  ipv4: Stop taking ECN bits into account in ip4-rules
  ipv4: Reject routes specifying ECN bits in rtm_tos
  ipv4: Use dscp_t in struct fib_alias

 include/net/inet_dscp.h  | 54 +++++++++++++++++++++++++++++++++++++
 include/net/ip_fib.h     |  3 ++-
 include/net/ipv6.h       |  6 +++++
 net/ipv4/fib_frontend.c  | 10 ++++++-
 net/ipv4/fib_lookup.h    |  3 ++-
 net/ipv4/fib_rules.c     | 17 ++++++------
 net/ipv4/fib_semantics.c | 14 +++++-----
 net/ipv4/fib_trie.c      | 58 +++++++++++++++++++++++-----------------
 net/ipv4/route.c         |  3 ++-
 net/ipv6/fib6_rules.c    | 18 ++++++++-----
 10 files changed, 138 insertions(+), 48 deletions(-)
 create mode 100644 include/net/inet_dscp.h

Comments

Guillaume Nault Dec. 6, 2021, 7:57 p.m. UTC | #1
On Mon, Dec 06, 2021 at 07:22:02PM +0100, Guillaume Nault wrote:
> Following my talk at LPC 2021 [1], here's a patch series whose
> objective is to start fixing the problems with how DSCP and ECN bits
> are handled in the kernel.

This is of course an RFC patch series, not meant for merging. Sorry for
the missing RFC tag in the subject (used the wrong script :/).
Toke Høiland-Jørgensen Dec. 14, 2021, 12:16 a.m. UTC | #2
Guillaume Nault <gnault@redhat.com> writes:

> Following my talk at LPC 2021 [1], here's a patch series whose
> objective is to start fixing the problems with how DSCP and ECN bits
> are handled in the kernel. This approach seemed to make consensus among
> the participants, although it implies a few behaviour changes for some
> corner cases of ip rule and ip route. Let's see if this consensus can
> survive a wider review :).

I like the approach, although I must admit to not being too familiar
with the parts of the code you're touching in this series. But I think
the typedefs make sense, and I (still) think it's a good idea to do the
conversion. I think the main thing to ensure from a backwards
compatibility PoV is that we don't silently change behaviour in a way
that is hard to detect. I.e., rejecting invalid configuration is fine
even if it was "allowed" before, but, say, changing the matching
behaviour so an existing rule set will still run unchanged but behave
differently is best avoided.

> Note, this patch series differs slightly from that of the original talk
> (slide 14 [2]). For the talk, I just cleared the ECN bits, while in
> this series, I do a bit-shift. This way dscp_t really represents DSCP
> values, as defined in RFCs. Also I've renamed the helper functions to
> replace "u8" by "dsfield", as I felt "u8" was ambiguous. Using
> "dsfield" makes it clear that dscp_t to u8 conversion isn't just a
> plain cast, but that a bit-shift happens and the result has the two ECN
> bits.

I like the names, but why do the bitshift? I get that it's conceptually
"cleaner", but AFAICT the shifted values are not actually used for
anything other than being shifted back again? In which case you're just
adding operations in the fast path for no reason...

> The new dscp_t type is then used to convert several field members:
>
>   * Patch 1 converts the tclass field of struct fib6_rule. It
>     effectively forbids the use of ECN bits in the tos/dsfield option
>     of ip -6 rule. Rules now match packets solely based on their DSCP
>     bits, so ECN doesn't influence the result anymore. This contrasts
>     with previous behaviour where all 8 bits of the Traffic Class field
>     was used. It is believed this change is acceptable as matching ECN
>     bits wasn't usable for IPv4, so only IPv6-only deployments could be
>     depending on it (that is, it's unlikely enough that a someone uses
>     ip6 rules matching ECN bits in production).

I think this is OK, cf the "break explicitly" thing I wrote above.

>   * Patch 2 converts the tos field of struct fib4_rule. This one too
>     effectively forbids defining ECN bits, this time in ip -4 rule.
>     Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
>     rejected. But even when accepted, the rule wouldn't match as the
>     packets would normally have their ECN bits cleared while doing the
>     rule lookup.

As above.

>   * Patch 3 converts the fc_tos field of struct fib_config. This is
>     like patch 2, but for ip4 routes. Routes using a tos/dsfield option
>     with any ECN bit set is now rejected. Before this patch, they were
>     accepted but, as with ip4 rules, these routes couldn't match any
>     real packet, since callers were supposed to clear their ECN bits
>     beforehand.

Didn't work at all, so also fine.

>   * Patch 4 converts the fa_tos field of struct fib_alias. This one is
>     pure internal u8 to dscp_t conversion. While patches 1-3 dealed
>     with user facing consequences, this patch shouldn't have any side
>     effect and is just there to give an overview of what such
>     conversion patches will look like. These are quite mechanical, but
>     imply some code churn.

This is reasonable, and I think the code churn is worth the extra
clarity. You should probably spell out in the commit message that it's
not intended to change behaviour, though.

> Note that there's no equivalent of patch 3 for IPv6 (ip route), since
> the tos/dsfield option is silently ignored for IPv6 routes.

Shouldn't we just start rejecting them, like for v4?

-Toke
Guillaume Nault Dec. 15, 2021, 4:48 p.m. UTC | #3
On Tue, Dec 14, 2021 at 01:16:43AM +0100, Toke Høiland-Jørgensen wrote:
> Guillaume Nault <gnault@redhat.com> writes:
> 
> > Following my talk at LPC 2021 [1], here's a patch series whose
> > objective is to start fixing the problems with how DSCP and ECN bits
> > are handled in the kernel. This approach seemed to make consensus among
> > the participants, although it implies a few behaviour changes for some
> > corner cases of ip rule and ip route. Let's see if this consensus can
> > survive a wider review :).
> 
> I like the approach, although I must admit to not being too familiar
> with the parts of the code you're touching in this series. But I think
> the typedefs make sense, and I (still) think it's a good idea to do the
> conversion. I think the main thing to ensure from a backwards
> compatibility PoV is that we don't silently change behaviour in a way
> that is hard to detect. I.e., rejecting invalid configuration is fine
> even if it was "allowed" before, but, say, changing the matching
> behaviour so an existing rule set will still run unchanged but behave
> differently is best avoided.
> 
> > Note, this patch series differs slightly from that of the original talk
> > (slide 14 [2]). For the talk, I just cleared the ECN bits, while in
> > this series, I do a bit-shift. This way dscp_t really represents DSCP
> > values, as defined in RFCs. Also I've renamed the helper functions to
> > replace "u8" by "dsfield", as I felt "u8" was ambiguous. Using
> > "dsfield" makes it clear that dscp_t to u8 conversion isn't just a
> > plain cast, but that a bit-shift happens and the result has the two ECN
> > bits.
> 
> I like the names, but why do the bitshift? I get that it's conceptually
> "cleaner", but AFAICT the shifted values are not actually used for
> anything other than being shifted back again? In which case you're just
> adding operations in the fast path for no reason...

That's right, the value is always shifted back again because all
current APIs work with the full dsfield (well all the ones I'm aware of
at least).

I switched to a bit shift when I tried writing down what dscp_t
was representing: I found it a bit clumsy to explain that it actually
wasn't exactly the DSCP. Also I didn't expect the bit shift to have any
mesurable impact.

Anyway, I don't mind reverting to a simple bit mask.

> > The new dscp_t type is then used to convert several field members:
> >
> >   * Patch 1 converts the tclass field of struct fib6_rule. It
> >     effectively forbids the use of ECN bits in the tos/dsfield option
> >     of ip -6 rule. Rules now match packets solely based on their DSCP
> >     bits, so ECN doesn't influence the result anymore. This contrasts
> >     with previous behaviour where all 8 bits of the Traffic Class field
> >     was used. It is believed this change is acceptable as matching ECN
> >     bits wasn't usable for IPv4, so only IPv6-only deployments could be
> >     depending on it (that is, it's unlikely enough that a someone uses
> >     ip6 rules matching ECN bits in production).
> 
> I think this is OK, cf the "break explicitly" thing I wrote above.
> 
> >   * Patch 2 converts the tos field of struct fib4_rule. This one too
> >     effectively forbids defining ECN bits, this time in ip -4 rule.
> >     Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
> >     rejected. But even when accepted, the rule wouldn't match as the
> >     packets would normally have their ECN bits cleared while doing the
> >     rule lookup.
> 
> As above.
> 
> >   * Patch 3 converts the fc_tos field of struct fib_config. This is
> >     like patch 2, but for ip4 routes. Routes using a tos/dsfield option
> >     with any ECN bit set is now rejected. Before this patch, they were
> >     accepted but, as with ip4 rules, these routes couldn't match any
> >     real packet, since callers were supposed to clear their ECN bits
> >     beforehand.
> 
> Didn't work at all, so also fine.
> 
> >   * Patch 4 converts the fa_tos field of struct fib_alias. This one is
> >     pure internal u8 to dscp_t conversion. While patches 1-3 dealed
> >     with user facing consequences, this patch shouldn't have any side
> >     effect and is just there to give an overview of what such
> >     conversion patches will look like. These are quite mechanical, but
> >     imply some code churn.
> 
> This is reasonable, and I think the code churn is worth the extra
> clarity.

Thanks for these feedbacks. These are the main points I wanted to
discuss with this RFC.

> You should probably spell out in the commit message that it's
> not intended to change behaviour, though.

Will do.

> > Note that there's no equivalent of patch 3 for IPv6 (ip route), since
> > the tos/dsfield option is silently ignored for IPv6 routes.
> 
> Shouldn't we just start rejecting them, like for v4?

I had some thoughs about that, but didn't talk about them in the cover
letter since I felt there was already enough edge cases to discuss, and
this one wasn't directly related to this series (the problem is there
regardless of this RFC).

So, on the one hand, we have this old policy of ignoring unknown
netlink attributes, so it looks consistent to also ignore unused
structure fields.

On the other hand, ignoring rtm_tos leads to a different behaviour than
what was requested. So it certainly makes sense to at least warn the
user. But a hard fail may break existing programs that don't clear
rtm_tos by mistake.

I'm not too sure which approach is better.

> -Toke
>
Dave Taht Dec. 15, 2021, 8:40 p.m. UTC | #4
I look forward to a final toss of TOS.

Cheered-on-by: Dave Taht <dave.taht@gmail.com>
Toke Høiland-Jørgensen Dec. 17, 2021, 5:55 p.m. UTC | #5
>> > Note that there's no equivalent of patch 3 for IPv6 (ip route), since
>> > the tos/dsfield option is silently ignored for IPv6 routes.
>> 
>> Shouldn't we just start rejecting them, like for v4?
>
> I had some thoughs about that, but didn't talk about them in the cover
> letter since I felt there was already enough edge cases to discuss, and
> this one wasn't directly related to this series (the problem is there
> regardless of this RFC).
>
> So, on the one hand, we have this old policy of ignoring unknown
> netlink attributes, so it looks consistent to also ignore unused
> structure fields.
>
> On the other hand, ignoring rtm_tos leads to a different behaviour than
> what was requested. So it certainly makes sense to at least warn the
> user. But a hard fail may break existing programs that don't clear
> rtm_tos by mistake.
>
> I'm not too sure which approach is better.

So I guess you could argue that those applications were broken in the
first place, and so an explicit reject would only expose this? Do you
know of any applications that actually *function* while doing what you
describe?

One thought could be to add the rejection but be prepared to back it out
if it does turn out (during the -rc phase) that it breaks something?

-Toke
Guillaume Nault Dec. 17, 2021, 10:52 p.m. UTC | #6
On Fri, Dec 17, 2021 at 06:55:43PM +0100, Toke Høiland-Jørgensen wrote:
> >> > Note that there's no equivalent of patch 3 for IPv6 (ip route), since
> >> > the tos/dsfield option is silently ignored for IPv6 routes.
> >> 
> >> Shouldn't we just start rejecting them, like for v4?
> >
> > I had some thoughs about that, but didn't talk about them in the cover
> > letter since I felt there was already enough edge cases to discuss, and
> > this one wasn't directly related to this series (the problem is there
> > regardless of this RFC).
> >
> > So, on the one hand, we have this old policy of ignoring unknown
> > netlink attributes, so it looks consistent to also ignore unused
> > structure fields.
> >
> > On the other hand, ignoring rtm_tos leads to a different behaviour than
> > what was requested. So it certainly makes sense to at least warn the
> > user. But a hard fail may break existing programs that don't clear
> > rtm_tos by mistake.
> >
> > I'm not too sure which approach is better.
> 
> So I guess you could argue that those applications were broken in the
> first place, and so an explicit reject would only expose this? Do you
> know of any applications that actually *function* while doing what you
> describe?

I don't know of any existing application that actually does. But it's
easy to imagine a developer setting only parts of the rtmsg structure
and leaving the rest uninitialised. Exposing the problem might not help
the end user, who may have no way to modify the broken program.

Also, for people using ifupdown (/etc/network/interfaces on Debian and
derivatives), rejecting a command can cancel the configuration of an
entire device section. So a stray tos option on an ip -6 route command
would now leave the network interface entirely unconfigured.

I'm not saying these situations exist, just trying to anticipate all
possible side effects.

> One thought could be to add the rejection but be prepared to back it out
> if it does turn out (during the -rc phase) that it breaks something?

Given that it's something that'd be easy to revert, maybe we can try
this approach.

> -Toke
>