mbox series

[net-next,00/14] nexthop: Resilient next-hop groups

Message ID cover.1615387786.git.petrm@nvidia.com (mailing list archive)
Headers show
Series nexthop: Resilient next-hop groups | expand

Message

Petr Machata March 10, 2021, 3:02 p.m. UTC
At this moment, there is only one type of next-hop group: an mpath group.
Mpath groups implement the hash-threshold algorithm, described in RFC
2992[1].

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus:

             +-------+-------+-------+-------+-------+
             |   1   |   2   |   3   |   4   |   5   |
             +-------+-+-----+---+---+-----+-+-------+
             |    1    |    2    |    4    |    5    |
             +---------+---------+---------+---------+

              Before and after deletion of next hop 3
	      under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them, which
may lead to TCP reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in the wrong order.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation on the SKB hash to choose a hash table
bucket, then reads the next hop that this bucket contains, and forwards
traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops:

             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	                      v v v v
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              Before and after deletion of next hop 3
	      under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

This patch set adds the implementation of resilient next-hop groups.

In a nutshell, the algorithm works as follows. Each next hop has a number
of buckets that it wants to have, according to its weight and the number of
buckets in the hash table. In case of an event that might cause bucket
allocation change, the numbers for individual next hops are updated,
similarly to how ranges are updated for mpath group next hops. Following
that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
next hop that is currently occupying more buckets than it wants (it is
"overweight"), it migrates the buckets to one of the next hops that has
fewer buckets than it wants (it is "underweight"). If, after this, there
are still underweight next hops, another upkeep run is scheduled to a
future time.

Chances are there are not enough "idle" buckets to satisfy the new demands.
The algorithm has knobs to select both what it means for a bucket to be
idle, and for whether and when to forcefully migrate buckets if there keeps
being an insufficient number of idle ones.

To illustrate the usage, consider the following commands:

 # ip nexthop add id 1 via 192.0.2.2 dev dummy1
 # ip nexthop add id 2 via 192.0.2.3 dev dummy1
 # ip nexthop add id 10 group 1/2 type resilient \
	buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

If not present in netlink message, the idle timer defaults to 120 seconds,
and there is no unbalanced timer, meaning the group may remain unbalanced
indefinitely. The value of 120 is the default in Cumulus implementation of
resilient next-hop groups. To a degree the default is arbitrary, the only
value that certainly does not make sense is 0. Therefore going with an
existing deployed implementation is reasonable.

Unbalanced time, i.e. how long since the last time that all nexthops had as
many buckets as they should according to their weights, is reported when
the group is dumped:

 # ip nexthop show id 10
 id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0

When replacing next hops or changing weights, if one does not specify some
parameters, their value is left as it was:

 # ip nexthop replace id 10 group 1,2/2 type resilient
 # ip nexthop show id 10
 id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0

It is also possible to do a dump of individual buckets (and now you know
why there were only 8 of them in the example above):

 # ip nexthop bucket show id 10
 id 10 index 0 idle_time 5.59 nhid 1
 id 10 index 1 idle_time 5.59 nhid 1
 id 10 index 2 idle_time 8.74 nhid 2
 id 10 index 3 idle_time 8.74 nhid 2
 id 10 index 4 idle_time 8.74 nhid 1
 id 10 index 5 idle_time 8.74 nhid 1
 id 10 index 6 idle_time 8.74 nhid 1
 id 10 index 7 idle_time 8.74 nhid 1

Note the two buckets that have a shorter idle time. Those are the ones that
were migrated after the nexthop replace command to satisfy the new demand
that nexthop 1 be given 6 buckets instead of 4.

The patchset proceeds as follows:

- Patches #1 and #2 are small refactoring patches.

- Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is
  meant to be set for all nexthop groups that in general have several
  nexthops from which they choose, and avoids a more expensive dispatch
  based on reading several flags, one for each nexthop group type.

- Patch #4 contains defines of new UAPI attributes and the new next-hop
  group type. At this point, the nexthop code is made to bounce the new
  type. As the resilient hashing code is gradually added in the following
  patch sets, it will remain dead. The last patch will make it accessible.

  This patch also adds a suite of new messages related to next hop buckets.
  This approach was taken instead of overloading the information on the
  existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons.

  First, a next-hop group can contain a large number of next-hop buckets
  (4k is not unheard of). This imposes limits on the amount of information
  that can be encoded for each next-hop bucket given a netlink message is
  limited to 64k bytes.

  Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this
  point, in the future it can be extended to provide user space with
  control over next-hop buckets configuration.

- Patch #5 contains the meat of the resilient next-hop group support.

- Patches #6 and #7 implement support for notifications towards the
  drivers.

- Patch #8 adds an interface for the drivers to report resilient hash
  table bucket activity. Drivers will be able to report through this
  interface whether traffic is hitting a given bucket.

- Patch #9 adds an interface for the drivers to report whether a given
  hash table bucket is offloaded or trapping traffic.

- In patches #10, #11, #12 and #13, UAPI is implemented. This includes all
  the code necessary for creation of resilient groups, bucket dumping and
  getting, and bucket migration notifications.

- In patch #14 the next-hop groups are finally made available.

The overall plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next-hop groups (already pushed)
3) Implementation of resilient next-hop groups (this patchset)
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1

v1 (changes since RFC):
- Patch #3:
    - This patch is new
- Patches #4-#13:
    - u32 -> u16 for bucket counts / indices
- Patch #5:
    - set the new flag is_multipath for resilient groups

Ido Schimmel (4):
  nexthop: Add netlink defines and enumerators for resilient NH groups
  nexthop: Add data structures for resilient group notifications
  nexthop: Allow setting "offload" and "trap" indication of nexthop
    buckets
  nexthop: Allow reporting activity of nexthop buckets

Petr Machata (10):
  nexthop: Pass nh_config to replace_nexthop()
  nexthop: __nh_notifier_single_info_init(): Make nh_info an argument
  nexthop: Add a dedicated flag for multipath next-hop groups
  nexthop: Add implementation of resilient next-hop groups
  nexthop: Implement notifiers for resilient nexthop groups
  nexthop: Add netlink handlers for resilient nexthop groups
  nexthop: Add netlink handlers for bucket dump
  nexthop: Add netlink handlers for bucket get
  nexthop: Notify userspace about bucket migrations
  nexthop: Enable resilient next-hop groups

 include/net/nexthop.h          |   72 +-
 include/uapi/linux/nexthop.h   |   43 +
 include/uapi/linux/rtnetlink.h |    7 +
 net/ipv4/nexthop.c             | 1524 ++++++++++++++++++++++++++++++--
 security/selinux/nlmsgtab.c    |    5 +-
 5 files changed, 1597 insertions(+), 54 deletions(-)

Comments

David Ahern March 11, 2021, 4:22 p.m. UTC | #1
On 3/10/21 8:02 AM, Petr Machata wrote:
> At this moment, there is only one type of next-hop group: an mpath group.
> Mpath groups implement the hash-threshold algorithm, described in RFC
> 2992[1].
> 
> To select a next hop, hash-threshold algorithm first assigns a range of
> hashes to each next hop in the group, and then selects the next hop by
> comparing the SKB hash with the individual ranges. When a next hop is
> removed from the group, the ranges are recomputed, which leads to
> reassignment of parts of hash space from one next hop to another. RFC 2992
> illustrates it thus:
> 
>              +-------+-------+-------+-------+-------+
>              |   1   |   2   |   3   |   4   |   5   |
>              +-------+-+-----+---+---+-----+-+-------+
>              |    1    |    2    |    4    |    5    |
>              +---------+---------+---------+---------+
> 
>               Before and after deletion of next hop 3
> 	      under the hash-threshold algorithm.
> 
> Note how next hop 2 gave up part of the hash space in favor of next hop 1,
> and 4 in favor of 5. While there will usually be some overlap between the
> previous and the new distribution, some traffic flows change the next hop
> that they resolve to.
> 
> If a multipath group is used for load-balancing between multiple servers,
> this hash space reassignment causes an issue that packets from a single
> flow suddenly end up arriving at a server that does not expect them, which
> may lead to TCP reset.
> 
> If a multipath group is used for load-balancing among available paths to
> the same server, the issue is that different latencies and reordering along
> the way causes the packets to arrive in the wrong order.
> 
> Resilient hashing is a technique to address the above problem. Resilient
> next-hop group has another layer of indirection between the group itself
> and its constituent next hops: a hash table. The selection algorithm uses a
> straightforward modulo operation on the SKB hash to choose a hash table
> bucket, then reads the next hop that this bucket contains, and forwards
> traffic there.
> 
> This indirection brings an important feature. In the hash-threshold
> algorithm, the range of hashes associated with a next hop must be
> continuous. With a hash table, mapping between the hash table buckets and
> the individual next hops is arbitrary. Therefore when a next hop is deleted
> the buckets that held it are simply reassigned to other next hops:
> 
>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>              |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 	                      v v v v
>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>              |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
>              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 
>               Before and after deletion of next hop 3
> 	      under the resilient hashing algorithm.
> 
> When weights of next hops in a group are altered, it may be possible to
> choose a subset of buckets that are currently not used for forwarding
> traffic, and use those to satisfy the new next-hop distribution demands,
> keeping the "busy" buckets intact. This way, established flows are ideally
> kept being forwarded to the same endpoints through the same paths as before
> the next-hop group change.
> 
> This patch set adds the implementation of resilient next-hop groups.
> 
> In a nutshell, the algorithm works as follows. Each next hop has a number
> of buckets that it wants to have, according to its weight and the number of
> buckets in the hash table. In case of an event that might cause bucket
> allocation change, the numbers for individual next hops are updated,
> similarly to how ranges are updated for mpath group next hops. Following
> that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
> next hop that is currently occupying more buckets than it wants (it is
> "overweight"), it migrates the buckets to one of the next hops that has
> fewer buckets than it wants (it is "underweight"). If, after this, there
> are still underweight next hops, another upkeep run is scheduled to a
> future time.
> 
> Chances are there are not enough "idle" buckets to satisfy the new demands.
> The algorithm has knobs to select both what it means for a bucket to be
> idle, and for whether and when to forcefully migrate buckets if there keeps
> being an insufficient number of idle ones.
> 
> To illustrate the usage, consider the following commands:
> 
>  # ip nexthop add id 1 via 192.0.2.2 dev dummy1
>  # ip nexthop add id 2 via 192.0.2.3 dev dummy1
>  # ip nexthop add id 10 group 1/2 type resilient \
> 	buckets 8 idle_timer 60 unbalanced_timer 300
> 
> The last command creates a resilient next-hop group. It will have 8
> buckets, each bucket will be considered idle when no traffic hits it for at
> least 60 seconds, and if the table remains out of balance for 300 seconds,
> it will be forcefully brought into balance.
> 
> If not present in netlink message, the idle timer defaults to 120 seconds,
> and there is no unbalanced timer, meaning the group may remain unbalanced
> indefinitely. The value of 120 is the default in Cumulus implementation of
> resilient next-hop groups. To a degree the default is arbitrary, the only
> value that certainly does not make sense is 0. Therefore going with an
> existing deployed implementation is reasonable.
> 
> Unbalanced time, i.e. how long since the last time that all nexthops had as
> many buckets as they should according to their weights, is reported when
> the group is dumped:
> 
>  # ip nexthop show id 10
>  id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0
> 
> When replacing next hops or changing weights, if one does not specify some
> parameters, their value is left as it was:
> 
>  # ip nexthop replace id 10 group 1,2/2 type resilient
>  # ip nexthop show id 10
>  id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0
> 
> It is also possible to do a dump of individual buckets (and now you know
> why there were only 8 of them in the example above):
> 
>  # ip nexthop bucket show id 10
>  id 10 index 0 idle_time 5.59 nhid 1
>  id 10 index 1 idle_time 5.59 nhid 1
>  id 10 index 2 idle_time 8.74 nhid 2
>  id 10 index 3 idle_time 8.74 nhid 2
>  id 10 index 4 idle_time 8.74 nhid 1
>  id 10 index 5 idle_time 8.74 nhid 1
>  id 10 index 6 idle_time 8.74 nhid 1
>  id 10 index 7 idle_time 8.74 nhid 1
> 
> Note the two buckets that have a shorter idle time. Those are the ones that
> were migrated after the nexthop replace command to satisfy the new demand
> that nexthop 1 be given 6 buckets instead of 4.
> 
> The patchset proceeds as follows:
> 
> - Patches #1 and #2 are small refactoring patches.
> 
> - Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is
>   meant to be set for all nexthop groups that in general have several
>   nexthops from which they choose, and avoids a more expensive dispatch
>   based on reading several flags, one for each nexthop group type.
> 
> - Patch #4 contains defines of new UAPI attributes and the new next-hop
>   group type. At this point, the nexthop code is made to bounce the new
>   type. As the resilient hashing code is gradually added in the following
>   patch sets, it will remain dead. The last patch will make it accessible.
> 
>   This patch also adds a suite of new messages related to next hop buckets.
>   This approach was taken instead of overloading the information on the
>   existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons.
> 
>   First, a next-hop group can contain a large number of next-hop buckets
>   (4k is not unheard of). This imposes limits on the amount of information
>   that can be encoded for each next-hop bucket given a netlink message is
>   limited to 64k bytes.
> 
>   Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this
>   point, in the future it can be extended to provide user space with
>   control over next-hop buckets configuration.
> 
> - Patch #5 contains the meat of the resilient next-hop group support.
> 
> - Patches #6 and #7 implement support for notifications towards the
>   drivers.
> 
> - Patch #8 adds an interface for the drivers to report resilient hash
>   table bucket activity. Drivers will be able to report through this
>   interface whether traffic is hitting a given bucket.
> 
> - Patch #9 adds an interface for the drivers to report whether a given
>   hash table bucket is offloaded or trapping traffic.
> 
> - In patches #10, #11, #12 and #13, UAPI is implemented. This includes all
>   the code necessary for creation of resilient groups, bucket dumping and
>   getting, and bucket migration notifications.
> 
> - In patch #14 the next-hop groups are finally made available.
> 
> The overall plan is to contribute approximately the following patchsets:
> 
> 1) Nexthop policy refactoring (already pushed)
> 2) Preparations for resilient next-hop groups (already pushed)
> 3) Implementation of resilient next-hop groups (this patchset)
> 4) Netdevsim offload plus a suite of selftests
> 5) Preparations for mlxsw offload of resilient next-hop groups
> 6) mlxsw offload including selftests
> 
> Interested parties can look at the current state of the code at [2] and
> [3].
> 
> [1] https://tools.ietf.org/html/rfc2992
> [2] https://github.com/idosch/linux/commits/submit/res_integ_v1
> [3] https://github.com/idosch/iproute2/commits/submit/res_v1
> 

well done and well documented. Thanks for the attention to detail there.

When you get to the end of the sets, it would be good to submit
documentation for resilient multipath under Documentation/networking
Petr Machata March 11, 2021, 5:18 p.m. UTC | #2
David Ahern <dsahern@gmail.com> writes:

> When you get to the end of the sets, it would be good to submit
> documentation for resilient multipath under Documentation/networking

All right.